All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
@ 2007-02-06 13:43 Al Boldi
  0 siblings, 0 replies; 92+ messages in thread
From: Al Boldi @ 2007-02-06 13:43 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:
> On Mon, 5 Feb 2007, Zach Brown wrote:
> > For syscalls, sure.
> >
> > The kevent work incorporates Uli's desire to have more data per event. 
> > Have you read his OLS stuff?  It's been a while since I did so I've lost
> > the details of why he cares to have more.
>
> You'd still do that as _arguments_ to the system call, not as the return
> value.
>
> Also, quite frankly, I tend to find Uli over-designs things. The whole
> disease of "make things general" is a CS disease that some people take to
> extreme.
>
> The good thing about generic code is not that it solves some generic
> problem. The good thing about generics is that they mean that you can
> _avoid_ solving other problems AND HAVE LESS CODE.

Yes, that would be generic code, in the pure sense.

> But some people seem to
> think that "generic" means that you have to have tons of code to handle
> all the possible cases, and that *completely* misses the point.

That would be generic code too, but by way of functional awareness. This is 
sometimes necessary, as no pure generic code has been found.

What's important is not the generic code, but rather the correct abstraction 
of the problem-domain, regardless of it's implementation, as that can be 
conveniently hidden behind the interface.

> We want less code. The whole (and really, the _only_) point of the
> fibrils, at least as far as I'm concerned, is to *not* have special code
> for aio_read/write/whatever.

What we want is correct code, and usually that means less code in the long 
run.  

So, instead of allowing the implementation to dictate the system design, it 
may be advisable to concentrate on the design first, to achieve an abstract 
interface that is realized by an implementation second.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  9:17                                                     ` Michael K. Edwards
@ 2007-02-07  9:37                                                       ` Michael K. Edwards
  0 siblings, 0 replies; 92+ messages in thread
From: Michael K. Edwards @ 2007-02-07  9:37 UTC (permalink / raw)
  To: Davide Libenzi, Kent Overstreet, Linus Torvalds, Zach Brown,
	Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

An idiot using my keyboard wrote:
>     - AIO requests that are serviced from cache ought to immediately
> invoke the callback, in the same thread context as the caller, fixing
> up the stack so that the callback returns to the instruction following
> the syscall.  That way the "immediate completion" path through the
> callback can manipulate data structures, allocate memory, etc. just as
> if it had followed a synchronous call.

Or, of course:
    if (async_stat(entry) == 0) {
        ... immediate completion code path ...
    }

Ugh.  But I think the discussion about the delayed path still holds.

- Michael

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  6:16                                                   ` Michael K. Edwards
@ 2007-02-07  9:17                                                     ` Michael K. Edwards
  2007-02-07  9:37                                                       ` Michael K. Edwards
  0 siblings, 1 reply; 92+ messages in thread
From: Michael K. Edwards @ 2007-02-07  9:17 UTC (permalink / raw)
  To: Davide Libenzi, Kent Overstreet, Linus Torvalds, Zach Brown,
	Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

Man, I should have edited that down before sending it.  Hopefully this
is clearer:

    - The usual programming model for AIO completion in GUIs, media
engines, and the like is an application callback.  Data that is
available immediately may be handled quite differently from data that
arrives after a delay, and usually the only reason for both code paths
to be in the same callback is shared code to maintain counters, etc.
associated with the AIO batch.  These shared operations, and the other
things one might want to do in the delayed path, needn't be able to
block or allocate memory.

    - AIO requests that are serviced from cache ought to immediately
invoke the callback, in the same thread context as the caller, fixing
up the stack so that the callback returns to the instruction following
the syscall.  That way the "immediate completion" path through the
callback can manipulate data structures, allocate memory, etc. just as
if it had followed a synchronous call.

    - AIO requests that need data not in cache should probably be
batched in order to avoid evicting the userspace AIO submission loop,
the immediate completion branch of the callback, and their data
structures from cache on every miss.  If you can use VM copy-on-write
tricks to punt a page of AIO request parameters and closure context
out to another CPU for immediate processing without stomping on your
local caches, great.

    - There's not much point in delivering AIO responses all the way
to userspace until the AIO submission loop is done, because they're
probably going to be handled through some completely different event
queue mechanism in the delayed path through the callback.  Trying to
squeeze a few AIO responses into the same data structure as if they
had been in cache is likely to create race conditions or impose
needless locking overhead on the otherwise serialized immediate
completion branch.

    - The result of the external AIO may arrive on a different CPU
with something completely else in foreground; but in real use cases
it's probably a different thread of the same process.  If you can use
the closure context page as the stack page for the kernel bit of the
AIO completion, and then use it again from userspace as the stack page
for the application bit, then the whole ISR -> softirq -> kernel
closure -> application closure path has minimal system impact.

    - The delayed path through the application callback can't block
and can't touch data structures that are thread-local or may be in an
incoherent state at this juncture (called during a more or less
arbitrary ISR exit path, a bit like a signal handler).  That's OK,
because it's probably just massaging the AIO response into fields of a
preallocated object dangling off of a global data structure and doing
a sem_post or some such.  (It might even just drop it if it's stale.)

    - As far as I can tell (knowing little about the scheduler per
se), these kernel closures aren't much like Zach's "fibrils"; they'd
be invoked from a tasklet chained more or less immediately after the
softirq dispatch tasklet.  I have no idea whether the cost of finding
the appropriate kernel closure(s) associated with the data that
arrived in the course of a softirq, pulling them over to the CPU where
the softirq just ran, and popping out to userspace to run the
application closure is exorbitant, or if it's even possible to force a
process switch from inside a tasklet that way.

Hope this helps, and sorry for the noise,
- Michael

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:44                                                 ` Joel Becker
  2007-02-07  1:15                                                   ` Davide Libenzi
@ 2007-02-07  6:16                                                   ` Michael K. Edwards
  2007-02-07  9:17                                                     ` Michael K. Edwards
  1 sibling, 1 reply; 92+ messages in thread
From: Michael K. Edwards @ 2007-02-07  6:16 UTC (permalink / raw)
  To: Davide Libenzi, Kent Overstreet, Linus Torvalds, Zach Brown,
	Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On 2/6/07, Joel Becker <Joel.Becker@oracle.com> wrote:
>         Not everything is in-cache.  Databases will be doing O_DIRECT
> and will expect that 90% of their I/O calls will block.  Why should they
> have to iterate this list every time?  If this is the API, they *have*
> to.  If there's an efficient way to get "just the ones that didn't
> block", then it's not a problem.

It's usually efficient, especially in terms of programmer effort, for
the immediate path to resemble as nearly as possible what you would
have done with the synchronous equivalent.  (If there's some value in
parallelizing the query across multiple CPUs, you probably don't want
the kernel guessing how to partition it.)  But what's efficient for
the delayed path is to be tightly bound to the arrival of the AIO
result, and to do little more than schedule it into the appropriate
event queue or drop it if it is stale.  The immediate and delayed
paths will often share part, but not all, of their implementation, and
most of the shared part is probably data structure setup that can
precede the call itself.  The rest of the delayed path is where the
design effort should go, because it's the part that has the sort of
complex impact on system performance that is hard for application
programmers to think clearly about.

Oracle isn't the only potential userspace user of massively concurrent
AIO with a significant, but not dominant, fraction of cache hits.  I'm
familiar with a similar use case in network monitoring, in which one
would like to implement the attribute tree and query translation more
or less as a userspace filesystem, while leaving both the front-end
caching and the back-end throttling, retries, etc. to in-kernel state
machines.  When 90% of the data requested by the front end (say, a
Python+WxWidgets GUI) is available from the VFS cache, only the other
10% should actually carry the AIO overhead.

Let's look at that immediately available fraction from the GUI
programmer's perspective.  He wants to look up some attributes from a
whole batch of systems, and wants to present all immediately available
results to the user, with the rest grayed out or something.  Each
request for data that is available from cache should result
immediately in a call to his (perhaps bytecode-language) callback,
which fills in a slot in the data structure that he's going to present
wholesale.  There's no reason why the immediate version of the
callback should be unable to allocate memory, poke at thread-local
structures, etc.; and in practice there's little to be gained by
parallelizing this fraction (or even aggressively delivering AIOs that
complete quickly) because you'd need to thread-safe that data
structure, which probably isn't worth it in performance and certainly
isn't in programmer effort and likelihood of Heisenbugs.

Delayed results, on the other hand, probably have to use the GUI's
event posting mechanism to queue the delivered data (probably in a
massaged form) into a GUI update thread.  Hence the delayed callback
can be delivered in some totally other context if it's VM- and
scheduler-efficient to do so; it's probably just doing a couple of
memcpys and a sem_post or some such.  The only reason it isn't a
totally separate chunk of code is that it uses the same context layout
as the immediate path, and may have to poke at some of the same
pre-allocated places to update completion statistics, etc.

(I implemented something similar to this in userspace using Python
generators for the closure-style callbacks, in the course of rewriting
a GUI that had a separate protocol translator process in place of the
userspace filesystem.  The thread pool that serviced responses from
the protocol translator operated much like Zach's fibrils, and used a
sort of lookup by request cookie to retrieve the closure and feed it
the result, which had the side effect of posting the appropriate
event.  It worked, fast, and it was more or less comprehensible to
later maintainers despite the use of Python's functional features,
because the AIO delivery was kept separate from both the plain-vanilla
immediate-path code and the GUI-idiom event queue processing.)

The broader issue here is that only the application developer really
knows how the AIO results ought to be funneled into the code that
wants them, which could be a database query engine or a GUI update
queue or Murphy knows what.  This "application AIO closure" step is
distinct from the application-neutral closure that needs to run in a
kernel "fibril" (extracting stat() results from an NFS response, or
whatever).  So it seems to me that applications ought to be able to
specify a userspace closure to be executed on async I/O completion (or
timeout, error, etc.), and this closure should be scheduled
efficiently on completion of the kernel bit.

The delayed path through the userspace closure would partly resemble a
signal handler in that it shouldn't touch thread or heap context, just
poke at pre-allocated process-global memory locations and/or
synchronization primitives.  (A closer parallel, for those familiar
with it, would be the "event handlers" of O/S's with cooperative
multitasking and a single foreground application; MacOS 6.x with
MultiFinder and PalmOS 4.x come to mind.)

What if we share a context+stack page between kernel and userspace to
be used by both the kernel "I/O completion" closure and the userspace
"event handler" closure?  After all, these are the pieces that
cooperatively multitask with one another.  Pop the kernel AIO closure
scheduler into the tasklet queue right after the softirq tasklet --
surely 99% of "fibrils" would become runnable due to something that
happens in a softirq, and it would make at least as much sense to run
there as in the task's schedule() path.  The event handler would be
scheduled in most respects like a signal handler in a POSIX threaded
process -- running largely in the context of some application thread
(on syscall exit or by preemption), and limited in the set of APIs it
can call.

In this picture, the ideal peristalsis would usually be ISR exit path
-> softirq -> kernel closure (possibly not thread-like at all, just a
completion scheduled from a tasklet) -> userspace closure ->
application thread.  The kernel and userspace closures could actually
share a stack page which also contains the completion context for
both.  Linus's async_stat() example is a good one, I think.  Here is
somewhat fuller userspace code, without the syntactic sugar that could
easily be used to make the callbacks more closure-ish:

/* linux/aeiou.h */
       typedef void (*aeiou_stat_cb_t) (int, struct aeiou_stat *);

       struct aeiou_stat __ALIGN_ME_PROPERLY__ {
               aeiou_stat_cb_t cb;        /* userspace completion hook */
               struct stat stat_buf;
               union {
                       int filedes;
                       char name[NAME_MAX+1];
               } u;
#ifdef __KERNEL__
               ... completion context for the kernel AIO closure ...
#endif
       }

       /* The returned pointer is the cookie for all */
       /* subsequent aeiou calls in this request group. */
       void *__aeiou_alloc_aeiou_stat(size_t uctx_bytes);

       #define aeiou_begin(ktype, utype, field) \
               (utype *)(__aeiou_alloc_##ktype(offsetof(utype, field))

/* foo.c */
       struct one_entry {
               ... closure context for the userspace event handler ...
               struct aeiou_stat s;
       }

       static void my_cb(int is_delayed, struct aeiou_stat *as) {
               struct one_entry *my_context = container_of(as, struct
one_entry, s);
               ... code that runs in userspace "event handler" context ...
       }

...

       struct one_entry *entry = aeiou_begin(aeiou_stat, struct one_entry, s);
       struct dirent *de;

       entry->s.cb = my_cb;
       /* set up some process-global data structure to hold */
       /* the results of this burst of async_stat calls */

       while ((de = readdir(dir)) != NULL) {
               strcpy(entry->s.u.name, de->d_name);
               /* set up any additional application context */
               /* in *entry for this individual async_stat call */

               aeiou_stat(entry);
       }
       /* application tracks outstanding AIOs using data structure */
       /* there could also be an aeiou_checkprogress(entry) */
       ...
       aeiou_end(entry);

(The use of "aeiou_stat" rather than a more general class of async I/O
calls is for illustration purposes.)

If the stat data is immediately available when aeiou_stat() is called,
the struct stat gets filled in and the callback is run immediately in
the current stack context.  If not, the contents of *entry are copied
to a new page (possibly using COW VM magic), and the syscall returns.
On the next trip through the scheduler (or when a large enough batch
of AIOs have been queued to be worth initiating them at the cost of
shoving the userspace code out of cache), the kernel closures are set
up in the opaque trailer to aeiou_stat in the copies, and the AIOs are
initiated.

The signature of aeiou_stat is deliberately limited to a single
pointer, since all of its arguments are likely to be interesting to
one or both closures.  There is no need to pass the offset to the
kernel parameter sub-struct into calls after the initial aeiou_begin;
the kernel has to check the validity of the "entry" pointer/cookie
anyway, so it had best keep track of the enclosing allocation bounds,
offset to the syscall parameter structure, etc. in a place where
userspace can't alter it.  Both kernel and userspace closures
eventually run with their stack in the shared page, after the closure
context area.  The userspace closure has to respect
signal-handler-like limitations on its powers if is_delayed is true;
it will run in the right process context but has no particular thread
context and can't call anything that could block or allocate memory.

I think this sort of interface might work well for both GUI event
frameworks and real-time streaming media playback/mixing, which are
two common ways for AIO to enter the mere userspace programmer's
sphere of concern (and also happen to be areas where I have some
expertise).  Would it work for the Oracle use case?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  1:15                                                   ` Davide Libenzi
  2007-02-07  1:24                                                     ` Kent Overstreet
@ 2007-02-07  1:30                                                     ` Joel Becker
  1 sibling, 0 replies; 92+ messages in thread
From: Joel Becker @ 2007-02-07  1:30 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 05:15:02PM -0800, Davide Libenzi wrote:
> I think ATM the core kernel implementation should be the focus, because 

	Yeah, I was thinking the same thing.  I originally posted just
to make the point :-)

Joel

-- 

Life's Little Instruction Book #99

	"Think big thoughts, but relish small pleasures."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  1:15                                                   ` Davide Libenzi
@ 2007-02-07  1:24                                                     ` Kent Overstreet
  2007-02-07  1:30                                                     ` Joel Becker
  1 sibling, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2007-02-07  1:24 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Joel Becker, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> If that's what is wanted, then the async_submit() API can detect the
> syncronous completion soon, and drop a result inside the result-queue
> immediately. It means that an immediately following async_wait() will find
> some completions soon. Or:
>
> struct async_submit {
>         void *cookie;
>         int sysc_nbr;
>         int nargs;
>         long args[ASYNC_MAX_ARGS];
> };
> struct async_result {
>         void *cookie;
>         long result:
> };
>
> int async_submit(struct async_submit *a, struct async_result *r, int n);
>
> Where "r" will store the ones that completed syncronously. I mean, there
> are really many ways to do this.

That interface (modifying async_submit to pass in the size of the
result array) would work great.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:04                                       ` Linus Torvalds
@ 2007-02-07  1:22                                         ` Kent Overstreet
  0 siblings, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2007-02-07  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On 2/6/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 6 Feb 2007, Kent Overstreet wrote:
> >
> > The "struct aiocb" isn't something you have to or necessarily want to
> > keep around.
>
> Oh, don't get me wrong - the _only_ reason for "struct aiocb" would be
> backwards compatibility. The point is, we'd need to keep that
> compatibility to be useful - otherwise we just end up having to duplicate
> the work (do _both_ fibrils _and_ the in-kernel AIO).

Bah, I was unclear here, sorry. I was talking about the userspace interface.

Right now, with the aio interface, io_submit passes in an array of
pointers to struct iocb; there's nothing that says the kernel will be
done with the structs when io_submit returns, so while userspace is
free to reuse the array of pointers, it can't free the actual iocbs
until they complete.

This is slightly stupid, for a couple reasons, and if we're making a
new pair of sycalls it'd be better to do it slightly differently.

What you want is for the async_submit syscall (or whatever it's
called) to pass in an array of structs, and for the kernel to not
reference them after async_submit returns. This is easy; after
async_submit returns, each syscall in the array is either completed
(if it could be without blocking), or in progress, and there's no
reason to need the arguments again.

It also means that the kernel has to copy in only a single userspace
buffer, instead of one buffer per syscall; as Joel mentions, there are
plenty of apps that will be doing 1000s of syscalls at once. From a
userspace perspective it's awesome, it simplifies coding for it and
means you have to hit the heap that much less.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:44                                                 ` Joel Becker
@ 2007-02-07  1:15                                                   ` Davide Libenzi
  2007-02-07  1:24                                                     ` Kent Overstreet
  2007-02-07  1:30                                                     ` Joel Becker
  2007-02-07  6:16                                                   ` Michael K. Edwards
  1 sibling, 2 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-07  1:15 UTC (permalink / raw)
  To: Joel Becker
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Joel Becker wrote:

> > - Is it more expensive to forcibly have to wait and fetch a result even 
> >   for in-cache syscalls, or it's faster to walk the submission array?
> 
> 	Not everything is in-cache.  Databases will be doing O_DIRECT
> and will expect that 90% of their I/O calls will block.  Why should they
> have to iterate this list every time?  If this is the API, they *have*
> to.  If there's an efficient way to get "just the ones that didn't
> block", then it's not a problem.

If that's what is wanted, then the async_submit() API can detect the 
syncronous completion soon, and drop a result inside the result-queue 
immediately. It means that an immediately following async_wait() will find 
some completions soon. Or:

struct async_submit {
        void *cookie;
        int sysc_nbr;
        int nargs;
        long args[ASYNC_MAX_ARGS];
};
struct async_result {
        void *cookie;
	long result:
};

int async_submit(struct async_submit *a, struct async_result *r, int n);

Where "r" will store the ones that completed syncronously. I mean, there 
are really many ways to do this.
I think ATM the core kernel implementation should be the focus, because 
IMO we just scratched the surface of the potential problems that something 
like this can arise (scheduling, signaling, cleanup, cancel - just to 
name a few).



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:23                                               ` Davide Libenzi
@ 2007-02-07  0:44                                                 ` Joel Becker
  2007-02-07  1:15                                                   ` Davide Libenzi
  2007-02-07  6:16                                                   ` Michael K. Edwards
  0 siblings, 2 replies; 92+ messages in thread
From: Joel Becker @ 2007-02-07  0:44 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 04:23:52PM -0800, Davide Libenzi wrote:
> To how many "sessions" those 1000 *parallel* I/O operations refer to? 
> Because, if you batch them in an async fashion, they have to be parallel.

	They're independant.  Of course they have to be parallel, that's
what I/O wants.

> Without the per-async operation status code, you'll need to wait a result 
> *for each* submitted syscall, even the ones that completed syncronously.

	You are right, but it's more efficient in some cases.

> Open questions are:
> 
> - Is the 1000 *parallel* syscall vectored submission case common?

	Sure is for I/O.  It's the majority of the case.  If you have
1000 blocks to send out, you want them all down at the request queue at
once, where they can merge.

> - Is it more expensive to forcibly have to wait and fetch a result even 
>   for in-cache syscalls, or it's faster to walk the submission array?

	Not everything is in-cache.  Databases will be doing O_DIRECT
and will expect that 90% of their I/O calls will block.  Why should they
have to iterate this list every time?  If this is the API, they *have*
to.  If there's an efficient way to get "just the ones that didn't
block", then it's not a problem.

Joel


-- 

"The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters."
	- Christopher Davis (ckd@loiosh.kei.com)

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:06                                             ` Joel Becker
@ 2007-02-07  0:23                                               ` Davide Libenzi
  2007-02-07  0:44                                                 ` Joel Becker
  0 siblings, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-07  0:23 UTC (permalink / raw)
  To: Joel Becker
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Joel Becker wrote:

> On Tue, Feb 06, 2007 at 03:56:14PM -0800, Davide Libenzi wrote:
> > Async syscall submissions are a _one time_ things. It's not like a live fd 
> > that you can push inside epoll and avoid the multiple O(N) passes.
> > First of all, the amount of syscalls that you'd submit in a vectored way 
> > are limited. They do not depend on the total number of connections, but on 
> 
> 	I regularly see apps that want to submit 1000 I/Os at once.
> Every submit.  But it's all against one or two file descriptors.  So, if
> you return to userspace, they have to walk all 1000 async_results every
> time, just to see which completed and which didn't.  And *then* go wait
> for the ones that didn't.  If they just wait for them all, they aren't
> spinning cpu on the -EASYNC operations.
> 	I'm not saying that "don't return a completion if we can
> non-block it" is inherently wrong or not a good idea.  I'm saying that
> we need a way to flag them efficiently.

To how many "sessions" those 1000 *parallel* I/O operations refer to? 
Because, if you batch them in an async fashion, they have to be parallel.
Without the per-async operation status code, you'll need to wait a result 
*for each* submitted syscall, even the ones that completed syncronously.
Open questions are:

- Is the 1000 *parallel* syscall vectored submission case common?

- Is it more expensive to forcibly have to wait and fetch a result even 
  for in-cache syscalls, or it's faster to walk the submission array?



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:56                                           ` Davide Libenzi
@ 2007-02-07  0:06                                             ` Joel Becker
  2007-02-07  0:23                                               ` Davide Libenzi
  0 siblings, 1 reply; 92+ messages in thread
From: Joel Becker @ 2007-02-07  0:06 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 03:56:14PM -0800, Davide Libenzi wrote:
> Async syscall submissions are a _one time_ things. It's not like a live fd 
> that you can push inside epoll and avoid the multiple O(N) passes.
> First of all, the amount of syscalls that you'd submit in a vectored way 
> are limited. They do not depend on the total number of connections, but on 

	I regularly see apps that want to submit 1000 I/Os at once.
Every submit.  But it's all against one or two file descriptors.  So, if
you return to userspace, they have to walk all 1000 async_results every
time, just to see which completed and which didn't.  And *then* go wait
for the ones that didn't.  If they just wait for them all, they aren't
spinning cpu on the -EASYNC operations.
	I'm not saying that "don't return a completion if we can
non-block it" is inherently wrong or not a good idea.  I'm saying that
we need a way to flag them efficiently.

Joel

-- 

Life's Little Instruction Book #80

	"Slow dance"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:39                                         ` Joel Becker
@ 2007-02-06 23:56                                           ` Davide Libenzi
  2007-02-07  0:06                                             ` Joel Becker
  0 siblings, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-06 23:56 UTC (permalink / raw)
  To: Joel Becker
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Joel Becker wrote:

> On Tue, Feb 06, 2007 at 03:23:47PM -0800, Davide Libenzi wrote:
> > struct async_submit {
> > 	void *cookie;
> > 	int sysc_nbr;
> > 	int nargs;
> > 	long args[ASYNC_MAX_ARGS];
> > 	int async_result;
> > };
> > 
> > int async_submit(struct async_submit *a, int n);
> > 
> > And async_submit() can mark each one ->async_result with -EASYNC (syscall 
> > has been batched), or another code (syscall completed w/out schedule).
> > IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.
> 
> 	There are pains here, though.  On every submit, you have to walk
> the entire vector just to know what did or did not complete.  I've seen
> this in other APIs (eg, async_result would be -EAGAIN for lack of
> resources to start this particular fibril).  Userspace submit ends up
> always walking the array of submissions twice - once to prep them, and
> once to check if they actually went async.  For longer lists of I/Os,
> this is expensive.

Async syscall submissions are a _one time_ things. It's not like a live fd 
that you can push inside epoll and avoid the multiple O(N) passes.
First of all, the amount of syscalls that you'd submit in a vectored way 
are limited. They do not depend on the total number of connections, but on 
the number of syscalls that you are actualy able to submit in parallel.
Note that it's not a trivial tasks to extract a long enough level of 
parallelism, that would make you feel pain in having to walk through the 
submission array. Think about the trivial web server case. Remote HTTP 
client asks one page, and you may think to batch a few ops together (like 
a stat, open, send headers, and sendfile for example), but those cannot be 
vectored since they have to complete in order. The stat would even trigger 
different response to the HTTP client. You need the open() fd to submit 
the send-headers and sendfile.
IMO there are no scalability problems in a multiple submission/retrieval 
API like the above (or any variation of it).



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:23                                       ` Davide Libenzi
@ 2007-02-06 23:39                                         ` Joel Becker
  2007-02-06 23:56                                           ` Davide Libenzi
  0 siblings, 1 reply; 92+ messages in thread
From: Joel Becker @ 2007-02-06 23:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 03:23:47PM -0800, Davide Libenzi wrote:
> struct async_submit {
> 	void *cookie;
> 	int sysc_nbr;
> 	int nargs;
> 	long args[ASYNC_MAX_ARGS];
> 	int async_result;
> };
> 
> int async_submit(struct async_submit *a, int n);
> 
> And async_submit() can mark each one ->async_result with -EASYNC (syscall 
> has been batched), or another code (syscall completed w/out schedule).
> IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.

	There are pains here, though.  On every submit, you have to walk
the entire vector just to know what did or did not complete.  I've seen
this in other APIs (eg, async_result would be -EAGAIN for lack of
resources to start this particular fibril).  Userspace submit ends up
always walking the array of submissions twice - once to prep them, and
once to check if they actually went async.  For longer lists of I/Os,
this is expensive.

Joel

-- 

"Too much walking shoes worn thin.
 Too much trippin' and my soul's worn thin.
 Time to catch a ride it leaves today
 Her name is what it means.
 Too much walking shoes worn thin."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 22:45                                     ` Kent Overstreet
  2007-02-06 23:04                                       ` Linus Torvalds
@ 2007-02-06 23:23                                       ` Davide Libenzi
  2007-02-06 23:39                                         ` Joel Becker
  1 sibling, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-06 23:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Kent Overstreet wrote:

> The trouble with differentiating between calls that block and calls
> that don't is you completely loose the ability to batch syscalls
> together; this is potentially a major win of an asynchronous
> interface.

It doesn't necessarly have to, once you extend the single return code to a 
vector:

struct async_submit {
	void *cookie;
	int sysc_nbr;
	int nargs;
	long args[ASYNC_MAX_ARGS];
	int async_result;
};

int async_submit(struct async_submit *a, int n);

And async_submit() can mark each one ->async_result with -EASYNC (syscall 
has been batched), or another code (syscall completed w/out schedule).
IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 22:45                                     ` Kent Overstreet
@ 2007-02-06 23:04                                       ` Linus Torvalds
  2007-02-07  1:22                                         ` Kent Overstreet
  2007-02-06 23:23                                       ` Davide Libenzi
  1 sibling, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-02-06 23:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Tue, 6 Feb 2007, Kent Overstreet wrote:
> 
> The "struct aiocb" isn't something you have to or necessarily want to
> keep around.

Oh, don't get me wrong - the _only_ reason for "struct aiocb" would be 
backwards compatibility. The point is, we'd need to keep that 
compatibility to be useful - otherwise we just end up having to duplicate 
the work (do _both_ fibrils _and_ the in-kernel AIO). 

> I don't see the point in having a ring for completed events, since
> it's at most two pointers per completion; quite a bit less data being
> sent back than for submissions.

I'm certainly personally perfectly happy with the kernel not remembering 
any completed events at all - once it's done, it's done and forgotten. So 
doing

	async(mycookie)
	wait_for_async(mycookie)

could actually return with -ECHILD (or similar error). 

In other words, if you see it as a "process interface" (instead of as a 
"filedescriptor interface"), I'd suggest automatic reaping of the fibril 
children. I do *not* think we want the equivalent of zombies - if only 
because they are just a lot of work to reap, and potentially a lot of 
memory to keep around.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 20:46                                   ` Linus Torvalds
  2007-02-06 21:16                                     ` David Miller
@ 2007-02-06 22:45                                     ` Kent Overstreet
  2007-02-06 23:04                                       ` Linus Torvalds
  2007-02-06 23:23                                       ` Davide Libenzi
  1 sibling, 2 replies; 92+ messages in thread
From: Kent Overstreet @ 2007-02-06 22:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On 2/6/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 5 Feb 2007, Kent Overstreet wrote:
> >
> > struct asys_ret {
> >     int ret;
> >     struct thread *p;
> > };
> >
> > struct asys_ret r;
> > r.p = me;
> >
> > async_read(fd, buf, nbytes, &r);
>
> That's horrible. It means that "r" cannot have automatic linkage (since
> the stack will be *gone* by the time we need to fill in "ret"), so now you
> need to track *two* pointers: "me" and "&r".

You'd only allocate r on the stack if that stack is going to be around
later; i.e. if you're using user threads. Otherwise, you just allocate
it in some struct containing your aiocb or whatever.

> And for user space, it means that we pass the _one_ thing around that we
> need for both identifying the async operation to the kernel (the "cookie")
> for wait or cancel, and the place where we expect the return value to be
> found (which in turn can _easily_ represent a whole "struct aiocb *",
> since the return value obviously has to be embedded in there anyway).
>
>                 Linus

The "struct aiocb" isn't something you have to or necessarily want to
keep around. It's the way the current aio interface works (which I've
coded to), but I don't really see the point. All it really contains is
the syscall arguments, but once the syscall's in progress there's no
reason the kernel has to refer back to it; similarly for userspace,
it's just another struct that userspace has to keep track of and free
at some later time.

In fact, that's the only sane way you can have a ring for submitted
system calls, as otherwise elements of the ring are getting freed in
essentially random order.

I don't see the point in having a ring for completed events, since
it's at most two pointers per completion; quite a bit less data being
sent back than for submissions.

-----

The trouble with differentiating between calls that block and calls
that don't is you completely loose the ability to batch syscalls
together; this is potentially a major win of an asynchronous
interface.

An app can have a bunch of cheap, fast user space threads servicing
whatever; as they run, they can push their system calls onto a global
stack. When no more can run, it does a giant asys_submit (something
similar to io_submit), then the io_getevents equivilant, running the
user threads that had their syscalls complete.

This doesn't mean you can't run synchronously the syscalls that
wouldn't block, or that you have to allocate a fibril for every
syscall - but for servers that care more about throughput than
latency, this is potentially a big win, in cache effects if nothing
else.

(And this doesn't prevent you from having a different syscall that
submits an asynchronous syscall, but runs it right away if it was able
to without blocking).

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:50                                           ` Linus Torvalds
@ 2007-02-06 22:28                                             ` Zach Brown
  0 siblings, 0 replies; 92+ messages in thread
From: Zach Brown @ 2007-02-06 22:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, kent.overstreet, davidel, mingo, linux-kernel,
	linux-aio, suparna, bcrl

> That's not how the patches work right now, but yes, I at least  
> personally
> think that it's something we should aim for (ie the interface  
> shouldn't
> _require_ us to always wait for things even if perhaps an early
> implementation might make everything be delayed at first)

I agree that we shouldn't require a seperate syscall just to get the  
return code from ops that didn't block.

It doesn't seem like much of a stretch to imagine a setup where we  
can specify completion context as part of the submission itself.

	declare_empty_ring(ring);
	struct submission sub;

	sub.ring = &ring;
	sub.nr = SYS_fstat64;
	sub.args == ...

	ret = submit(&sub, 1);
	if (ret == 0) {
		wait_for_elements(&ring, 1);
		printf("stat gave %d\n", ring[ring->head].rc);
	}

You get the idea, it's just an outline.

wait_for_elements() could obviously check the ring before falling  
back to kernel sync.  I'm pretty keen on the notion of producer/ 
consumer rings where userspace writes the head as it plucks  
completions and the kernel writes the tail as it adds them.

We might want per-call ring pointers, instead of per submission, to  
help submitters wait for a group of ops to complete without having to  
do their own tracking on event completion.  That only makes sense if  
we have the waiting mechanics let you only be woken as the number of  
events in the ring crosses some threshold.  Which I think we want  
anyway.

We'd be trading building up a specific completion state with syscalls  
for some complexity during submission that pins (and kmaps on  
completion) the user pages.  Submission could return failure if  
pinning these new pages would push us over some rlimit.  We'd have to  
be *awfully* careful not to let userspace corrupt (munmap?) the ring  
and confuse the hell out of the kernel.

Maybe not worth it, but if we *really* cared about making the non- 
blocking case almost identical to the sync case and wanted to use the  
same interface for batch submission and async completion then this  
seems like a possibility.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:31                                         ` David Miller
  2007-02-06 21:46                                           ` Eric Dumazet
@ 2007-02-06 21:50                                           ` Linus Torvalds
  2007-02-06 22:28                                             ` Zach Brown
  1 sibling, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-02-06 21:50 UTC (permalink / raw)
  To: David Miller
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl



On Tue, 6 Feb 2007, David Miller wrote:
> 
> So the idea is to just run it to completion if it won't block and use
> a fibril if it would?

That's not how the patches work right now, but yes, I at least personally 
think that it's something we should aim for (ie the interface shouldn't 
_require_ us to always wait for things even if perhaps an early 
implementation might make everything be delayed at first)

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:31                                         ` David Miller
@ 2007-02-06 21:46                                           ` Eric Dumazet
  2007-02-06 21:50                                           ` Linus Torvalds
  1 sibling, 0 replies; 92+ messages in thread
From: Eric Dumazet @ 2007-02-06 21:46 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, kent.overstreet, davidel, zach.brown, mingo,
	linux-kernel, linux-aio, suparna, bcrl

David Miller a écrit :
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Tue, 6 Feb 2007 13:28:34 -0800 (PST)
> 
>> Yeah, in 1% of all cases it will block, and you'll want to wait for them. 
>> Maybe the kevent queue works then, but if it needs any more setup than the 
>> nonblocking case, that's a big no.
> 
> So the idea is to just run it to completion if it won't block and use
> a fibril if it would?
> 
> kevent could support something like that too.

It seems to me that kevent was designed to handle many events sources on a 
single endpoint, like epoll (but with different internals). Typical load of 
thousand of sockets/pipes providers glued into one queue.

In the fibril case, I guess a thread wont have many fibrils lying around...

Also, kevent needs a fd lookup/fput to retrieve some queued events, and that 
may be a performance hit for the AIO case, (fget/fput in a multi-threaded 
program cost some atomic ops)


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:28                                       ` Linus Torvalds
@ 2007-02-06 21:31                                         ` David Miller
  2007-02-06 21:46                                           ` Eric Dumazet
  2007-02-06 21:50                                           ` Linus Torvalds
  0 siblings, 2 replies; 92+ messages in thread
From: David Miller @ 2007-02-06 21:31 UTC (permalink / raw)
  To: torvalds
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 6 Feb 2007 13:28:34 -0800 (PST)

> Yeah, in 1% of all cases it will block, and you'll want to wait for them. 
> Maybe the kevent queue works then, but if it needs any more setup than the 
> nonblocking case, that's a big no.

So the idea is to just run it to completion if it won't block and use
a fibril if it would?

kevent could support something like that too.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:16                                     ` David Miller
@ 2007-02-06 21:28                                       ` Linus Torvalds
  2007-02-06 21:31                                         ` David Miller
  0 siblings, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-02-06 21:28 UTC (permalink / raw)
  To: David Miller
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl



On Tue, 6 Feb 2007, David Miller wrote:
> 
> I really think that Evgeniy's kevent is a good event notification
> mechanism for anything, including AIO.
> 
> Events are events, applications want a centralized way to receive and
> process them.

Don't be silly. AIO isn't an event. AIO is an *action*.

The event part is hopefully something that doesn't even *happen*.

Why do people ignore this? Look at a web server: I can pretty much 
guarantee that 99% of all filesystem accesses are cached, and doing them 
as "events" would be a total and utter waste of time.

You want to do them synchronously, as fast as possible, and you do NOT 
want to see them as any kind of asynchronous events.

Yeah, in 1% of all cases it will block, and you'll want to wait for them. 
Maybe the kevent queue works then, but if it needs any more setup than the 
nonblocking case, that's a big no.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 20:46                                   ` Linus Torvalds
@ 2007-02-06 21:16                                     ` David Miller
  2007-02-06 21:28                                       ` Linus Torvalds
  2007-02-06 22:45                                     ` Kent Overstreet
  1 sibling, 1 reply; 92+ messages in thread
From: David Miller @ 2007-02-06 21:16 UTC (permalink / raw)
  To: torvalds
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 6 Feb 2007 12:46:11 -0800 (PST)

> And for user space, it means that we pass the _one_ thing around that we 
> need for both identifying the async operation to the kernel (the "cookie") 
> for wait or cancel, and the place where we expect the return value to be 
> found (which in turn can _easily_ represent a whole "struct aiocb *", 
> since the return value obviously has to be embedded in there anyway).

I really think that Evgeniy's kevent is a good event notification
mechanism for anything, including AIO.

Events are events, applications want a centralized way to receive and
process them.

It's already implemented, and if there are tangible problems with it,
Evgeniy has been excellent at responding to criticism and implementing
suggested changes to the interfaces.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:31                                 ` Kent Overstreet
  2007-02-06 20:25                                   ` Davide Libenzi
@ 2007-02-06 20:46                                   ` Linus Torvalds
  2007-02-06 21:16                                     ` David Miller
  2007-02-06 22:45                                     ` Kent Overstreet
  1 sibling, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-06 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Mon, 5 Feb 2007, Kent Overstreet wrote:
> 
> You don't need an explicit cookie if you're passing in a pointer to
> the return code, it doesn't really save you anything to do so. Say
> you've got a bunch of user threads (with or without stacks, it doesn't
> matter).
> 
> struct asys_ret {
>     int ret;
>     struct thread *p;
> };
> 
> struct asys_ret r;
> r.p = me;
> 
> async_read(fd, buf, nbytes, &r);

That's horrible. It means that "r" cannot have automatic linkage (since 
the stack will be *gone* by the time we need to fill in "ret"), so now you 
need to track *two* pointers: "me" and "&r".

Wouldn't it be much better to just track one (both in user space and in 
kernel space).

In kernel space, the "one pointer" would be the fibril pointer (which 
needs to have all the information necessary for completing the operation 
anyway), and in user space, it would be better to have just the cookie be 
a pointer to the place where you expect the return value (since you need 
both anyway).

I think the point here (for *both* the kernel and user space) would be to 
try to keep the interfaces really easy to use. For the kernel, it means 
that we don't ever pass anything new around: the "fibril" pointer is 
basically defined by the current execution thread.

And for user space, it means that we pass the _one_ thing around that we 
need for both identifying the async operation to the kernel (the "cookie") 
for wait or cancel, and the place where we expect the return value to be 
found (which in turn can _easily_ represent a whole "struct aiocb *", 
since the return value obviously has to be embedded in there anyway).

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:31                                 ` Kent Overstreet
@ 2007-02-06 20:25                                   ` Davide Libenzi
  2007-02-06 20:46                                   ` Linus Torvalds
  1 sibling, 0 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-06 20:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Mon, 5 Feb 2007, Kent Overstreet wrote:

> > > HOWEVER, they get returned differently. The cookie gets returned
> > > immediately, the system call result gets returned in-memory only after the
> > > async thing has actually completed.
> > >
> > > I would actually argue that it's not the kernel that should generate any
> > > cookie, but that user-space should *pass*in* the cookie it wants to, and
> > > the kernel should consider it a pointer to a 64-bit entity which is the
> > > return code.
> > 
> > Yes. Let's have the userspace to "mark" the async operation. IMO the
> > cookie should be something transparent to the kernel.
> > Like you said though, that'd require compat-code (unless we fix the size).
> 
> You don't need an explicit cookie if you're passing in a pointer to
> the return code, it doesn't really save you anything to do so. Say
> you've got a bunch of user threads (with or without stacks, it doesn't
> matter).
> 
> struct asys_ret {
>     int ret;
>     struct thread *p;
> };
> 
> struct asys_ret r;
> r.p = me;
> 
> async_read(fd, buf, nbytes, &r);

Hmm, are you working for Symbian? Because that's exactly how they track 
pending async operations (address of a status variable - wrapped in a 
class of course, being them) ;)
That's another way of doing it, IMO no better no worse than letting 
explicit cookie selection from userspace. You still have to have the 
compat code though, either ways.



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06  0:27                   ` Scot McKinley
  2007-02-06  0:48                     ` David Miller
@ 2007-02-06  0:48                     ` Joel Becker
  1 sibling, 0 replies; 92+ messages in thread
From: Joel Becker @ 2007-02-06  0:48 UTC (permalink / raw)
  To: Scot McKinley
  Cc: Linus Torvalds, bert hubert, Davide Libenzi, Ingo Molnar,
	Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Mon, Feb 05, 2007 at 04:27:44PM -0800, Scot McKinley wrote:
> Finally, it is agreed that neg-errno is a much better approach for the 
> return code. The threading/concurrency issues associated w/ the current 
> unix errno has always been buggy area for Oracle Networking code.

	As Scot knows, when Oracle started using the current io_submit(2)
and io_getevents(2), -errno was a big win.

Joel

-- 

"Born under a bad sign.
 I been down since I began to crawl.
 If it wasn't for bad luck,
 I wouldn't have no luck at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06  0:27                   ` Scot McKinley
@ 2007-02-06  0:48                     ` David Miller
  2007-02-06  0:48                     ` Joel Becker
  1 sibling, 0 replies; 92+ messages in thread
From: David Miller @ 2007-02-06  0:48 UTC (permalink / raw)
  To: scot.mckinley
  Cc: torvalds, bert.hubert, davidel, mingo, zach.brown, linux-kernel,
	linux-aio, suparna, bcrl

From: Scot McKinley <scot.mckinley@oracle.com>
Date: Mon, 05 Feb 2007 16:27:44 -0800

> As Joel mentioned earlier, from an Oracle perspective, one of the key 
> things we are looking for is a nice clean *common* wait point.

How much investigation have the Oracle folks (besides Zach :-) done
into Evgeniy's kevent interfaces and how much feedback have they given
to him.

I know it sounds like I'm being a pain in the ass, but it saddens
me that there is this whole large body of work implemented to solve
a problem, the maintainer keeps posting patch sets and the whole
discussions has gone silent.

I'd be quiet if there were some well formulated objections to his work
being posted, but people are posting nothing.  So either it's a
perfect API or people aren't giving it the attention and consideration
it deserves.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:09                               ` Davide Libenzi
  2007-02-05 21:31                                 ` Kent Overstreet
@ 2007-02-06  0:32                                 ` Davide Libenzi
  1 sibling, 0 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-06  0:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Davide Libenzi wrote:

> On Mon, 5 Feb 2007, Linus Torvalds wrote:
> 
> > Indeed. One word is *exactly* what a normal system call returns too.
> > 
> > That said, normally we have a user-space library layer to turn that into 
> > the "errno + return value" thing, and in the case of async() calls we 
> > very basically wouldn't have that. So either:
> > 
> >  - we'd need to do it in the kernel (which is actually nasty, since 
> >    different system calls have slightly different semantics - some don't 
> >    return any error value at all, and negative numbers are real numbers)
> > 
> >  - we'd have to teach user space about the "negative errno" mechanism, in 
> >    which case one word really is alwats enough.
> > 
> > Quite frankly, I much prefer the second alternative. The "negative errno" 
> > thing has not only worked really really well inside the kernel, it's so 
> > obviously 100% superior to the standard UNIX "-1 + errno" approach that 
> > it's not even funny. 
> 
> Currently it's in the syscall wrapper. Couldn't we have it in the 
> asys_teardown_stack() stub?

Eeeek, that was something *really* stupid I said :D



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:57                 ` Linus Torvalds
  2007-02-05 22:07                   ` bert hubert
  2007-02-05 22:34                   ` Davide Libenzi
@ 2007-02-06  0:27                   ` Scot McKinley
  2007-02-06  0:48                     ` David Miller
  2007-02-06  0:48                     ` Joel Becker
  2 siblings, 2 replies; 92+ messages in thread
From: Scot McKinley @ 2007-02-06  0:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bert hubert, Davide Libenzi, Ingo Molnar, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


As Joel mentioned earlier, from an Oracle perspective, one of the key 
things we are looking for is a nice clean *common* wait point. We don't 
really care whether this common wait point is the old libaio:async-poll, 
epoll, or "wait_for_async". And if "wait_for_async" has the added 
benefit of scaling, all the better.

However, it is desirable for that common wait-routine to have the 
ability to return explicit completions, instead of requiring a follow-on 
call to some other query/wait for events/completions for each of the 
different type of async submissions done (poll, pid, i/o, ...). 
Obviously not a "must-have", but desirable.

It is also desirable (if possible) to have immediate completions (either 
immediate errs or async submissions that complete synchronously) 
communicated at submission time, instead of via the common wait-routine.

Finally, it is agreed that neg-errno is a much better approach for the 
return code. The threading/concurrency issues associated w/ the current 
unix errno has always been buggy area for Oracle Networking code.

Regards, -Scot 

Linus Torvalds wrote:

>On Mon, 5 Feb 2007, bert hubert wrote:
>  
>
>>From my end as an application developer, yes please. Either make it
>>perfectly ok to have thousands of outstanding asynchronous system calls (I
>>work with thousands of separate sockets), or allow me to select/poll/epoll
>>on the "async fd".
>>    
>>
>
>No can do.
>
>Allocating an fd is actually too expensive, exactly because a lot of these 
>operations are supposed to be a few hundred ns, and taking locks is simply 
>a bad idea.
>
>But if you want to, we could have a *separate* "convert async cookie to 
>fd" so that you can poll for it, or something.
>
>I doubt very many people want to do that. It would tend to simply be nicer 
>to do
>
>	async(poll);
>	async(waitpid);
>	async(.. wait foranything else ..)
>
>followed by a
>
>	wait_for_async();
>
>That's just a much NICER approach, I would argue. And it automatically 
>and very naturally solves the "wait for different kinds of events" 
>question, in a way that "poll()" never did (except by turning all events 
>into file descriptors or signals).
>
>  
>
>>Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
>>well, the fd might be easier.
>>    
>>
>
>Again. NO WAY. Signals are just damn expensive. At most, it would be an 
>option again, but if you want high performance, signals simply aren't very 
>good. They are also a nice way to make your user-space code very racy.
>
>  
>
>>In fact, perhaps the communication channel might simply *be* an fd. Queueing
>>up syscalls sounds remarkably like sending datagrams. 
>>    
>>
>
>I'm the first to say that file descriptors is the UNIX way, but so are 
>processes, and I think this is MUCH better done as a "process" interface. 
>In other words, instead of doing it as a filedescriptor, do it as a 
>"micro-fork/exec", and have the "wait()" equivalent. It's just that we 
>don't fork a "real process", and we don't exec a "real program", we just 
>exec a single system call.
>
>If you think of it in those terms, it all makes sense *without* any file 
>descriptors what-so-ever, and the "wait_for_async()" interface also makes 
>a ton of sense (it really *is* "waitpid()" for the system call).
>
>		Linus
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-aio' in
>the body to majordomo@kvack.org.  For more info on Linux AIO,
>see: http://www.kvack.org/aio/
>Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>  
>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:44                   ` David Miller
@ 2007-02-06  0:15                     ` Davide Libenzi
  0 siblings, 0 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-06  0:15 UTC (permalink / raw)
  To: David Miller
  Cc: zach.brown, Ingo Molnar, torvalds, Linux Kernel Mailing List,
	linux-aio, suparna, bcrl

On Mon, 5 Feb 2007, David Miller wrote:

> From: Davide Libenzi <davidel@xmailserver.org>
> Date: Mon, 5 Feb 2007 10:24:34 -0800 (PST)
> 
> > Yes, no need for the above. We can just host a poll/epoll in an async() 
> > operation, and demultiplex once that gets ready.
> 
> I can hear Evgeniy crying 8,000 miles away.
> 
> I strongly encourage a lot of folks commenting in this thread to
> familiarize themselves with kevent and how it handles this stuff.  I
> see a lot of suggestions for things he has totally implemented and
> solved already in kevent.

David, I'm sorry but I only briefly looked at the work Evgeniy did on 
kevent. So excuse me if I say something broken in the next few sentences.
Zab's async syscall interface is a pretty simple one. It accepts the 
syscall number, the parameters for the syscall, and a cookie. It returns a 
syscall result code, and your cookie (that's the meat of it, at least). 
IMO its interface should be optimized for what it does.
Could this submission/retrieval be inglobated inside a "generic" 
submission/retrieval API? Sure you can. But then you end up having 
submission/event structures with 17 members, 3 of which are valid at each 
time. The API becomes more difficult to use IMO, because suddendly you 
have to know which field are good for each event you're submitting/fetching.
IMHO, genericity can be built in userspace, *if* one really wants it and, 
of course, provided that the OS gives you the tools to build it.
The problem before, was that it was hard to bridge something like poll/epoll 
with other "by nature" sync operations. Evgeniy kevent is one attempt, 
Zab's async is another one. IMO the async syscall is a *very poweful* one, 
since it allows for *total coverage* async support w/out "plugs" all over 
the kernel paths.
But as Zab said, the kernel implementation is more important ATM.




- Davide


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:57                 ` Linus Torvalds
  2007-02-05 22:07                   ` bert hubert
@ 2007-02-05 22:34                   ` Davide Libenzi
  2007-02-06  0:27                   ` Scot McKinley
  2 siblings, 0 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bert hubert, Ingo Molnar, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Linus Torvalds wrote:

> On Mon, 5 Feb 2007, bert hubert wrote:
> > 
> > From my end as an application developer, yes please. Either make it
> > perfectly ok to have thousands of outstanding asynchronous system calls (I
> > work with thousands of separate sockets), or allow me to select/poll/epoll
> > on the "async fd".
> 
> No can do.
> 
> Allocating an fd is actually too expensive, exactly because a lot of these 
> operations are supposed to be a few hundred ns, and taking locks is simply 
> a bad idea.
> 
> But if you want to, we could have a *separate* "convert async cookie to 
> fd" so that you can poll for it, or something.
> 
> I doubt very many people want to do that. It would tend to simply be nicer 
> to do
> 
> 	async(poll);
> 	async(waitpid);
> 	async(.. wait foranything else ..)
> 
> followed by a
> 
> 	wait_for_async();
> 
> That's just a much NICER approach, I would argue. And it automatically 
> and very naturally solves the "wait for different kinds of events" 
> question, in a way that "poll()" never did (except by turning all events 
> into file descriptors or signals).

Bert, that was the first suggestion I gave to Zab. But then I realized 
that a multiplexed poll/epoll can be "hosted" in an async op, just like 
Linus showed above. Will work just fine IMO.




- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 22:07                   ` bert hubert
@ 2007-02-05 22:15                     ` Zach Brown
  0 siblings, 0 replies; 92+ messages in thread
From: Zach Brown @ 2007-02-05 22:15 UTC (permalink / raw)
  To: bert hubert
  Cc: Linus Torvalds, Davide Libenzi, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> It has me excited in any case. Once anything even remotely testable  
> appears
> (Zach tells me not to try the current code), I'll work it into MTasker
> (http://ds9a.nl/mtasker) and make it power a nameserver that does  
> async i/o,
> for use with very very large zones that aren't preloaded.

I'll be sure to let you know :)

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:57                 ` Linus Torvalds
@ 2007-02-05 22:07                   ` bert hubert
  2007-02-05 22:15                     ` Zach Brown
  2007-02-05 22:34                   ` Davide Libenzi
  2007-02-06  0:27                   ` Scot McKinley
  2 siblings, 1 reply; 92+ messages in thread
From: bert hubert @ 2007-02-05 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ingo Molnar, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Mon, Feb 05, 2007 at 01:57:15PM -0800, Linus Torvalds wrote:

> I doubt very many people want to do that. It would tend to simply be nicer 
> to do
> 
> 	async(poll);

Yeah - I saw that technique being mentioned later on in the thread, and it
would work, I think.

To make up for the waste of time, some other news. I asked Matt Dillon of
DragonflyBSD why he removed asynchronous system calls from his OS, and he
told me it was because of the problems he had implementing them in the
kernel:

    There were two basic problems:  First, it added a lot of overhead when
    most system calls are either non-blocking anyway (like getpid()). 
    Second and more importantly it was very, very difficult to recode the
    system calls that COULD block to actually be asynchronous in the kernel. 
    I spent some time recoding nanosleep() to operate asynchronously and it
    was a huge mess.

Aside from that, they did not discover any skeletons hidden in the closet,
although from mailing list traffic, I gather the asynchronous system calls
didn't see a lot of use. If I understand it correctly, for a number of years
they emulated asynchronous system calls using threads.

We'd be sidestepping the need to update all syscalls via 'fibrils' of
course.

> If you think of it in those terms, it all makes sense *without* any file 
> descriptors what-so-ever, and the "wait_for_async()" interface also makes 
> a ton of sense (it really *is* "waitpid()" for the system call).

It has me excited in any case. Once anything even remotely testable appears
(Zach tells me not to try the current code), I'll work it into MTasker
(http://ds9a.nl/mtasker) and make it power a nameserver that does async i/o,
for use with very very large zones that aren't preloaded.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:36               ` bert hubert
@ 2007-02-05 21:57                 ` Linus Torvalds
  2007-02-05 22:07                   ` bert hubert
                                     ` (2 more replies)
  0 siblings, 3 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-05 21:57 UTC (permalink / raw)
  To: bert hubert
  Cc: Davide Libenzi, Ingo Molnar, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Mon, 5 Feb 2007, bert hubert wrote:
> 
> From my end as an application developer, yes please. Either make it
> perfectly ok to have thousands of outstanding asynchronous system calls (I
> work with thousands of separate sockets), or allow me to select/poll/epoll
> on the "async fd".

No can do.

Allocating an fd is actually too expensive, exactly because a lot of these 
operations are supposed to be a few hundred ns, and taking locks is simply 
a bad idea.

But if you want to, we could have a *separate* "convert async cookie to 
fd" so that you can poll for it, or something.

I doubt very many people want to do that. It would tend to simply be nicer 
to do

	async(poll);
	async(waitpid);
	async(.. wait foranything else ..)

followed by a

	wait_for_async();

That's just a much NICER approach, I would argue. And it automatically 
and very naturally solves the "wait for different kinds of events" 
question, in a way that "poll()" never did (except by turning all events 
into file descriptors or signals).

> Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
> well, the fd might be easier.

Again. NO WAY. Signals are just damn expensive. At most, it would be an 
option again, but if you want high performance, signals simply aren't very 
good. They are also a nice way to make your user-space code very racy.

> In fact, perhaps the communication channel might simply *be* an fd. Queueing
> up syscalls sounds remarkably like sending datagrams. 

I'm the first to say that file descriptors is the UNIX way, but so are 
processes, and I think this is MUCH better done as a "process" interface. 
In other words, instead of doing it as a filedescriptor, do it as a 
"micro-fork/exec", and have the "wait()" equivalent. It's just that we 
don't fork a "real process", and we don't exec a "real program", we just 
exec a single system call.

If you think of it in those terms, it all makes sense *without* any file 
descriptors what-so-ever, and the "wait_for_async()" interface also makes 
a ton of sense (it really *is* "waitpid()" for the system call).

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 18:24                 ` Davide Libenzi
@ 2007-02-05 21:44                   ` David Miller
  2007-02-06  0:15                     ` Davide Libenzi
  0 siblings, 1 reply; 92+ messages in thread
From: David Miller @ 2007-02-05 21:44 UTC (permalink / raw)
  To: davidel
  Cc: zach.brown, mingo, torvalds, linux-kernel, linux-aio, suparna, bcrl

From: Davide Libenzi <davidel@xmailserver.org>
Date: Mon, 5 Feb 2007 10:24:34 -0800 (PST)

> Yes, no need for the above. We can just host a poll/epoll in an async() 
> operation, and demultiplex once that gets ready.

I can hear Evgeniy crying 8,000 miles away.

I strongly encourage a lot of folks commenting in this thread to
familiarize themselves with kevent and how it handles this stuff.  I
see a lot of suggestions for things he has totally implemented and
solved already in kevent.

I'm not talking about Zach's fibril's, I'm talking about the interface
aspects of these discussions.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:37             ` Davide Libenzi
  2007-02-03  0:02               ` Davide Libenzi
  2007-02-05 17:12               ` Zach Brown
@ 2007-02-05 21:36               ` bert hubert
  2007-02-05 21:57                 ` Linus Torvalds
  2 siblings, 1 reply; 92+ messages in thread
From: bert hubert @ 2007-02-05 21:36 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Fri, Feb 02, 2007 at 03:37:09PM -0800, Davide Libenzi wrote:

> Since I still think that the many-thousands potential async operations 
> coming from network sockets are better handled with a classical event 
> machanism [1], and since smooth integration of new async syscall into the 
> standard POSIX infrastructure is IMO a huge win, I think we need to have a 
> "bridge" to allow async completions being detectable through a pollable 
> (by the mean of select/poll/epoll whatever) device.

> [1] Unless you really want to have thousands of kthreads/fibrils lingering 
>     on the system.

>From my end as an application developer, yes please. Either make it
perfectly ok to have thousands of outstanding asynchronous system calls (I
work with thousands of separate sockets), or allow me to select/poll/epoll
on the "async fd".

Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
well, the fd might be easier.

In fact, perhaps the communication channel might simply *be* an fd. Queueing
up syscalls sounds remarkably like sending datagrams. 

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:09                               ` Davide Libenzi
@ 2007-02-05 21:31                                 ` Kent Overstreet
  2007-02-06 20:25                                   ` Davide Libenzi
  2007-02-06 20:46                                   ` Linus Torvalds
  2007-02-06  0:32                                 ` Davide Libenzi
  1 sibling, 2 replies; 92+ messages in thread
From: Kent Overstreet @ 2007-02-05 21:31 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> > HOWEVER, they get returned differently. The cookie gets returned
> > immediately, the system call result gets returned in-memory only after the
> > async thing has actually completed.
> >
> > I would actually argue that it's not the kernel that should generate any
> > cookie, but that user-space should *pass*in* the cookie it wants to, and
> > the kernel should consider it a pointer to a 64-bit entity which is the
> > return code.
>
> Yes. Let's have the userspace to "mark" the async operation. IMO the
> cookie should be something transparent to the kernel.
> Like you said though, that'd require compat-code (unless we fix the size).

You don't need an explicit cookie if you're passing in a pointer to
the return code, it doesn't really save you anything to do so. Say
you've got a bunch of user threads (with or without stacks, it doesn't
matter).

struct asys_ret {
     int ret;
     struct thread *p;
};

struct asys_ret r;
r.p = me;

async_read(fd, buf, nbytes, &r);

Then you just have your async_getevents return the same pointers you
passed in, and your main event loop gets pointers to its threads for
free.

It seems cleaner to do it this way vs. returning structs with the
actual return code and a cookie, as threads get the return code
exactly where they want it.

Keep in mind that the epoll way (while great for epoll, I do love it)
makes sense because it doesn't have to deal with any sort of return
codes.

My only other point is that you really do want a bulk asys_submit
instead of doing a syscall per async syscall; one of the great wins of
this approach is heavily IO driven apps can batch up syscalls.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:39                             ` Linus Torvalds
  2007-02-05 21:09                               ` Davide Libenzi
@ 2007-02-05 21:21                               ` Zach Brown
  1 sibling, 0 replies; 92+ messages in thread
From: Zach Brown @ 2007-02-05 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ingo Molnar, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> - we'd need to do it in the kernel (which is actually nasty, since
>    different system calls have slightly different semantics - some  
> don't
>    return any error value at all, and negative numbers are real  
> numbers)
>
>  - we'd have to teach user space about the "negative errno"  
> mechanism, in
>    which case one word really is alwats enough.
>
> Quite frankly, I much prefer the second alternative. The "negative  
> errno"
> thing has not only worked really really well inside the kernel,  
> it's so
> obviously 100% superior to the standard UNIX "-1 + errno" approach  
> that
> it's not even funny.

I agree, and I imagine you'd have a hard time finding someone who  
actually *likes* the errno convention :)

> I would actually argue that it's not the kernel that should  
> generate any
> cookie, but that user-space should *pass*in* the cookie it wants  
> to, and
> the kernel should consider it a pointer to a 64-bit entity which is  
> the
> return code.

Yup.  That's how the current code (and epoll, and fs/aio.c, and..) work.

Cancelation comes into this discussion, I think.  Hopefully its  
reasonable to expect userspace to be able to manage cookies well  
enough that they can use them to issue cancels and only hit the ops  
they intend to.  It means we have to give them the tools to  
differentiate between a racing completion and cancelation so they can  
reuse a cookie at the right time, but that doesn't sound fatal.

>  - make everything use 64-bit values.

This would be my preference.

> Now, making an architecture-independent system call enumeration may
> actually make sense regardless, because it would allow sys_async()  
> to have
> its own system call table and put the limitations and rules for those
> system calls there, instead of depending on the per-architecture  
> system
> call table that tends to have some really architecture-specific  
> details.

Maybe, sure.  I don't have a lot of insight into this.  Hopefully  
some arch maintainers can jump in?

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:39                             ` Linus Torvalds
@ 2007-02-05 21:09                               ` Davide Libenzi
  2007-02-05 21:31                                 ` Kent Overstreet
  2007-02-06  0:32                                 ` Davide Libenzi
  2007-02-05 21:21                               ` Zach Brown
  1 sibling, 2 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Linus Torvalds wrote:

> Indeed. One word is *exactly* what a normal system call returns too.
> 
> That said, normally we have a user-space library layer to turn that into 
> the "errno + return value" thing, and in the case of async() calls we 
> very basically wouldn't have that. So either:
> 
>  - we'd need to do it in the kernel (which is actually nasty, since 
>    different system calls have slightly different semantics - some don't 
>    return any error value at all, and negative numbers are real numbers)
> 
>  - we'd have to teach user space about the "negative errno" mechanism, in 
>    which case one word really is alwats enough.
> 
> Quite frankly, I much prefer the second alternative. The "negative errno" 
> thing has not only worked really really well inside the kernel, it's so 
> obviously 100% superior to the standard UNIX "-1 + errno" approach that 
> it's not even funny. 

Currently it's in the syscall wrapper. Couldn't we have it in the 
asys_teardown_stack() stub?



> HOWEVER, they get returned differently. The cookie gets returned 
> immediately, the system call result gets returned in-memory only after the 
> async thing has actually completed.
> 
> I would actually argue that it's not the kernel that should generate any 
> cookie, but that user-space should *pass*in* the cookie it wants to, and 
> the kernel should consider it a pointer to a 64-bit entity which is the 
> return code.

Yes. Let's have the userspace to "mark" the async operation. IMO the 
cookie should be something transparent to the kernel.
Like you said though, that'd require compat-code (unless we fix the size).



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:21                             ` Zach Brown
@ 2007-02-05 20:42                               ` Linus Torvalds
  0 siblings, 0 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-05 20:42 UTC (permalink / raw)
  To: Zach Brown
  Cc: Davide Libenzi, Ingo Molnar, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise



On Mon, 5 Feb 2007, Zach Brown wrote:
> 
> For syscalls, sure.
> 
> The kevent work incorporates Uli's desire to have more data per event.  Have
> you read his OLS stuff?  It's been a while since I did so I've lost the
> details of why he cares to have more.

You'd still do that as _arguments_ to the system call, not as the return 
value.

Also, quite frankly, I tend to find Uli over-designs things. The whole 
disease of "make things general" is a CS disease that some people take to 
extreme.

The good thing about generic code is not that it solves some generic 
problem. The good thing about generics is that they mean that you can 
_avoid_ solving other problems AND HAVE LESS CODE. But some people seem to 
think that "generic" means that you have to have tons of code to handle 
all the possible cases, and that *completely* misses the point.

We want less code. The whole (and really, the _only_) point of the 
fibrils, at least as far as I'm concerned, is to *not* have special code 
for aio_read/write/whatever.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:10                           ` Davide Libenzi
  2007-02-05 20:21                             ` Zach Brown
@ 2007-02-05 20:39                             ` Linus Torvalds
  2007-02-05 21:09                               ` Davide Libenzi
  2007-02-05 21:21                               ` Zach Brown
  1 sibling, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-05 20:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Zach Brown, Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise



On Mon, 5 Feb 2007, Davide Libenzi wrote:
>
> No, that's *really* it ;)
>
> The cookie you pass, and the return code of the syscall.
> If there other data transfered? Sure, but that data transfered during the 
> syscall processing, and handled by the syscall (filling up a sys_read 
> buffer just for example).

Indeed. One word is *exactly* what a normal system call returns too.

That said, normally we have a user-space library layer to turn that into 
the "errno + return value" thing, and in the case of async() calls we 
very basically wouldn't have that. So either:

 - we'd need to do it in the kernel (which is actually nasty, since 
   different system calls have slightly different semantics - some don't 
   return any error value at all, and negative numbers are real numbers)

 - we'd have to teach user space about the "negative errno" mechanism, in 
   which case one word really is alwats enough.

Quite frankly, I much prefer the second alternative. The "negative errno" 
thing has not only worked really really well inside the kernel, it's so 
obviously 100% superior to the standard UNIX "-1 + errno" approach that 
it's not even funny. 

To see why "negative errno" is better, just look at any threaded program, 
or look at any program that does multiple calls and needs to save the 
errno not from the last one, but from some earlier one (eg, it does a 
"close()" in between returning *its* error, and the real operation that 
we care about).

> Did I miss something? The async() syscall will allow (with few 
> restrictions) to execute whatever syscall in an async fashion. An syscall 
> returns a result code (long). Plus, you need to pass back the 
> userspace-provided cookie of course.

HOWEVER, they get returned differently. The cookie gets returned 
immediately, the system call result gets returned in-memory only after the 
async thing has actually completed.

I would actually argue that it's not the kernel that should generate any 
cookie, but that user-space should *pass*in* the cookie it wants to, and 
the kernel should consider it a pointer to a 64-bit entity which is the 
return code.

In other words, the only "cookie" we need is literally the pointer to the 
results. And that's obviously something that the user space has to set up 
anyway.

So how about making the async interface be:

	// returns negative for error
	// zero for "synchronous"
	// positive kernel "wait for me" cookie for success
	long sys_async_submit(
		unsigned long flags,
		long *user_result_ptr,
		long syscall,
		unsigned long *args);

and the "args" thing would literally just fill up the registers.

The real downside here is that it's very architecture-specific this way, 
and that means that x86-64 (and other 64-bit ones) would need to have 
emulation layers for the 32-bit ones, but they likely need to do that 
*anyway*, so it's probably not a huge downside. The alternative is to:

 - make a new architecture-independent system call enumeration for the 
   async interface

 - make everything use 64-bit values.
		
Now, making an architecture-independent system call enumeration may 
actually make sense regardless, because it would allow sys_async() to have 
its own system call table and put the limitations and rules for those 
system calls there, instead of depending on the per-architecture system 
call table that tends to have some really architecture-specific details.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:10                           ` Davide Libenzi
@ 2007-02-05 20:21                             ` Zach Brown
  2007-02-05 20:42                               ` Linus Torvalds
  2007-02-05 20:39                             ` Linus Torvalds
  1 sibling, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-05 20:21 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> No, that's *really* it ;)

For syscalls, sure.

The kevent work incorporates Uli's desire to have more data per  
event.  Have you read his OLS stuff?  It's been a while since I did  
so I've lost the details of why he cares to have more.

Let me say it again, maybe a little louder this time:  I'm not  
interested in worrying about this aspect of the API until the  
scheduler mechanics are more solidified.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 19:41                         ` Zach Brown
@ 2007-02-05 20:10                           ` Davide Libenzi
  2007-02-05 20:21                             ` Zach Brown
  2007-02-05 20:39                             ` Linus Torvalds
  0 siblings, 2 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 20:10 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The "result" of one async operation is basically a cookie and a result
> > code. Eight or sixteen bytes at most.
> 
> s/basically/minimally/
> 
> Well, yeah.  The patches I sent had:
> 
> struct asys_completion {
>        long            return_code;
>        unsigned long   cookie;
> };
> 
> That's as stupid as it gets.

No, that's *really* it ;)
The cookie you pass, and the return code of the syscall.
If there other data transfered? Sure, but that data transfered during the 
syscall processing, and handled by the syscall (filling up a sys_read 
buffer just for example).




> > IMO, before going wacko designing
> > complex shared userspace-kernel result buffers, I think it'd be better
> > measuring the worth-value of the thing ;)
> 
> Obviously, yes.
> 
> The potential win is to be able to have one place to wait for collection from
> multiple sources.  Some of them might want more data per event.  They can
> always indirect out via a cookie pointer,  sure, but at insanely high message
> rates (10gige small messages) one might not want that.

Did I miss something? The async() syscall will allow (with few 
restrictions) to execute whatever syscall in an async fashion. An syscall 
returns a result code (long). Plus, you need to pass back the 
userspace-provided cookie of course. A cookie is very likely a direct 
pointer to the userspace session the async syscall applies to, so a
"(my_session *) results[i].cookie" will bring you directly on topic.
Collection of multiple sources? What do you mean? What's wrong with:

int async_wait(struct asys_completion *results, int nresults);

Is saving an 8/16 bytes double copy worth going wacko in designing shared 
userspace/kernel buffers, when the syscall that lays behind an 
asys_completion is prolly touching KBs of RAM during its execution?




- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 19:26                       ` Davide Libenzi
@ 2007-02-05 19:41                         ` Zach Brown
  2007-02-05 20:10                           ` Davide Libenzi
  0 siblings, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-05 19:41 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> The "result" of one async operation is basically a cookie and a result
> code. Eight or sixteen bytes at most.

s/basically/minimally/

Well, yeah.  The patches I sent had:

struct asys_completion {
         long            return_code;
         unsigned long   cookie;
};

That's as stupid as it gets.

> IMO, before going wacko designing
> complex shared userspace-kernel result buffers, I think it'd be better
> measuring the worth-value of the thing ;)

Obviously, yes.

The potential win is to be able to have one place to wait for  
collection from multiple sources.  Some of them might want more data  
per event.  They can always indirect out via a cookie pointer,  sure,  
but at insanely high message rates (10gige small messages) one might  
not want that.

See also: the kevent thread.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 19:20                 ` Zach Brown
@ 2007-02-05 19:38                   ` Davide Libenzi
  0 siblings, 0 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 19:38 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > Or we need some sort of enter_context()/leave_context() (adopt mm, files,
> > ...) to have a per-CPU kthread to be able to execute the syscall from the
> > async() caller context.
> 
> I believe that's what Ingo is hoping for, yes.

Ok, but then we should ask ourselves if it's really worth to have a 
per-CPU pool (that will require quite a few changes to the current way 
of doing things), or a per-process pool (that would basically work as is). 
What advantage gives us a per-CPU pool?
Setup cost? Not really IMO. Thread creation is pretty cheap, and a typical 
process using async will have a pretty huge lifespan (compared to the pool 
creation cost).
Configurability scores for a per-process pool, because it may allow each 
process (eventually) to size his own.
What's the real point in favour of a per-CPU pool, that justify all the 
changes that will have to be done in order to adopt such concept?



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 17:44                     ` Zach Brown
@ 2007-02-05 19:26                       ` Davide Libenzi
  2007-02-05 19:41                         ` Zach Brown
  0 siblings, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 19:26 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The normal and most optimal workflow should be a user-space ring-buffer
> > of these constant-size struct async_syscall entries:
> > 
> >  struct async_syscall ringbuffer[1024];
> > 
> >  LIST_HEAD(submitted);
> >  LIST_HEAD(pending);
> >  LIST_HEAD(completed);
> 
> I strongly disagree here, and I'm hoping you're not as keen on this now --
> your reply to Matt gives me hope.
> 
> As mentioned, that they complete out-of-order leads, at least, to having
> separate submission and completion rings.  I'm not sure a submission ring
> makes any sense given the goal of processing the calls in submission and only
> creating threads if it blocks.  A simple copy of an array of these input
> structs sounds fine to me.

The "result" of one async operation is basically a cookie and a result 
code. Eight or sixteen bytes at most. IMO, before going wacko designing 
complex shared userspace-kernel result buffers, I think it'd be better 
measuring the worth-value of the thing ;)



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 18:52               ` Davide Libenzi
@ 2007-02-05 19:20                 ` Zach Brown
  2007-02-05 19:38                   ` Davide Libenzi
  0 siblings, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-05 19:20 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> Or we need some sort of enter_context()/leave_context() (adopt mm,  
> files,
> ...) to have a per-CPU kthread to be able to execute the syscall  
> from the
> async() caller context.

I believe that's what Ingo is hoping for, yes.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 17:02             ` Zach Brown
@ 2007-02-05 18:52               ` Davide Libenzi
  2007-02-05 19:20                 ` Zach Brown
  0 siblings, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 18:52 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The 'pool' of kernel threads doesnt even have to be per-task, it can be
> > a natural per-CPU thing
> 
> Yeah, absolutely.

Hmmm, so we issue an async sys_read(), what a get_file(fd) will return for 
a per-CPU kthread executing such syscall? Unless we teach context_switch() 
to do a inherit-trick for "files" (even in that case, it won't work if 
we switch from another context). And, is it all for it?
IMO it's got to be either a per-process thread pool or a fibril approach.
Or we need some sort of enter_context()/leave_context() (adopt mm, files, 
...) to have a per-CPU kthread to be able to execute the syscall from the 
async() caller context. Hmmm?



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 17:12               ` Zach Brown
@ 2007-02-05 18:24                 ` Davide Libenzi
  2007-02-05 21:44                   ` David Miller
  0 siblings, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-05 18:24 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > Since I still think that the many-thousands potential async operations
> > coming from network sockets are better handled with a classical event
> > machanism [1], and since smooth integration of new async syscall into the
> > standard POSIX infrastructure is IMO a huge win, I think we need to have a
> > "bridge" to allow async completions being detectable through a pollable
> > (by the mean of select/poll/epoll whatever) device.
> 
> Ugh, I'd rather not if we don't have to.
> 
> It seems like you could get this behaviour from issuing a
> poll/select(really?)/epoll as one of the async calls to complete.  (And you
> mention this in a later email? :))

Yes, no need for the above. We can just host a poll/epoll in an async() 
operation, and demultiplex once that gets ready.



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-04  5:12   ` Davide Libenzi
@ 2007-02-05 17:54     ` Zach Brown
  0 siblings, 0 replies; 92+ messages in thread
From: Zach Brown @ 2007-02-05 17:54 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

>
>
>> +	current->per_call = next->per_call;
>
> Pointer instead of structure copy?

Sure, there are lots of trade-offs there, but the story changes if we  
keep the 1:1 relationship between task_struct and thread_info.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  8:23                   ` Ingo Molnar
  2007-02-03  9:25                     ` Matt Mackall
@ 2007-02-05 17:44                     ` Zach Brown
  2007-02-05 19:26                       ` Davide Libenzi
  1 sibling, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-05 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> But really, being a scheduler guy i was much more concerned about the
> duplication and problems caused by the fibril concept itself - which
> duplication and complexity makes up 80% of Zach's submitted patchset.
> For example this bit:
>
>    [PATCH 3 of 4] Teach paths to wake a specific void * target
>
> would totally go away if we used kernel threads for this.

Uh, would it?  Are you talking about handing off the *task_struct*  
that it was submitted under to each worker thread that inherits the  
stack?

I guess I hadn't considered going that far.  I had somehow  
constructed a block in my mind that we couldn't release the  
task_struct from the submitting task.  But maybe we can be clever  
enough with the task_struct updating that userspace wouldn't notice a  
significant change.

Hmm.

> i totally agree that the API /should/ be the main focus - but i didnt
> pick the topic and most of the patchset's current size is due to  
> the IMO
> avoidable fibril concept.

I, too, totally agree.  I didn't even approach the subject for  
exactly the reason you allude to -- I wanted to get the hard parts of  
the kernel side right first.

> regarding the API, i dont really agree with the current form and  
> design
> of Zach's interface.

Haha, well, yes, of course.  You couldn't have thought that the dirt- 
stupid sys_asys_wait_for_completion() was anything more than simple  
scaffolding to test the kernel bits.

> fundamentally, the basic entity of this thing should be a /system  
> call/,
> not the artificial fibril thing:
>
>   +struct asys_call {
>   +       struct asys_result      *result;
>   +       struct fibril           fibril;
>   +};

You picked a weird struct to highlight here.  struct asys_input seems  
more related to the stuff you go on to discuss below.  This asys_call  
struct is a relatively irrelevant internal detail of how  
asys_teardown_stack() gets from a fibril to the pre-allocated  
completion state once the call has returned.

> The normal and most optimal workflow should be a user-space ring- 
> buffer
> of these constant-size struct async_syscall entries:
>
>   struct async_syscall ringbuffer[1024];
>
>   LIST_HEAD(submitted);
>   LIST_HEAD(pending);
>   LIST_HEAD(completed);

I strongly disagree here, and I'm hoping you're not as keen on this  
now -- your reply to Matt gives me hope.

As mentioned, that they complete out-of-order leads, at least, to  
having separate submission and completion rings.  I'm not sure a  
submission ring makes any sense given the goal of processing the  
calls in submission and only creating threads if it blocks.  A simple  
copy of an array of these input structs sounds fine to me.

When I think about the completion side I tend to hope we can end up  
with something like what VJ talked about in his net channels work.   
producer/consumer rings with head and tail pointers in different  
cache lines.  AFAIK the kevent work has headed in that direction, but  
I haven't kept up.  Uli has certainly mentioned it in his 'ec' (event  
channels) proposals.

The posix AIO list completion and, sadly, signals on completion need  
to be considered, too.

Honestly, though, I'm not worried about this part.  We'll easily come  
to an agreement.  I'm just not going to distract myself with it until  
we're happy with the scheduler side.

> Looks like we can hit many birds with this single stone: AIO, vectored
> syscalls, finegrained system-call parallelism. Hm?

Hmm, indeed.  Some flags could let userspace tell the kernel not to  
bother with all this threading/concurrency/aio nonsense and just  
issue them serially.  It'll sound nuts in these days of cheap  
syscalls and vsyscall helpers, but some Oracle folks might love this  
for issuing a gettimeofday() pair around syscalls they want to profile.

I hadn't considered that as a potential property of this interface.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:37             ` Davide Libenzi
  2007-02-03  0:02               ` Davide Libenzi
@ 2007-02-05 17:12               ` Zach Brown
  2007-02-05 18:24                 ` Davide Libenzi
  2007-02-05 21:36               ` bert hubert
  2 siblings, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-05 17:12 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> Since I still think that the many-thousands potential async operations
> coming from network sockets are better handled with a classical event
> machanism [1], and since smooth integration of new async syscall  
> into the
> standard POSIX infrastructure is IMO a huge win, I think we need to  
> have a
> "bridge" to allow async completions being detectable through a  
> pollable
> (by the mean of select/poll/epoll whatever) device.

Ugh, I'd rather not if we don't have to.

It seems like you could get this behaviour from issuing a poll/select 
(really?)/epoll as one of the async calls to complete.  (And you  
mention this in a later email? :))

Part of my thinking on this is that we might want it to be really  
trivial to create and wait on groups of ops.. maybe as a context.   
One of the things posix AIO wants is the notion of submitting and  
waiting on a group of ops as a "list".  That sounds like we might be  
able to implement it by issuing ops against a context, created as  
part of the submission, and then waiting for it to drain.

Being able to wait on that with file->poll() obviously requires  
juggling file-> associations which sounds like more weight than we  
might want.  Or it'd be optional and we'd get more moving parts and  
divergent paths to test.

So, sure, it's possible and not terribly difficult, but I'd rather  
avoid it if people can be convinced to get the same behaviour by  
issuing an async instance of their favourite readiness syscall.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:21           ` Ingo Molnar
  2007-02-02 22:49             ` Linus Torvalds
  2007-02-02 23:37             ` Davide Libenzi
@ 2007-02-05 17:02             ` Zach Brown
  2007-02-05 18:52               ` Davide Libenzi
  2 siblings, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-05 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> ok, i think i noticed another misunderstanding. The kernel thread  
> based
> scheme i'm suggesting would /not/ 'switch' to another kernel thread in
> the cached case, by default. It would just execute in the original
> context (as if it were a synchronous syscall), and the switch to a
> kernel thread from the pool would only occur /if/ the context is about
> to block. (this 'switch' thing would be done by the scheduler)

Yeah, this is what I imagined when you described doing this with  
threads instead of these 'fibril' things.

It sounds like you're suggesting that we keep the 1:1 relationship  
between task_struct and thread_info.  That would avoid the risks that  
the current fibril approach brings.  It insists that all of  
task_struct is shared between concurrent fibrils (even if only  
between blocking points).  As I understand what Ingo is suggesting,  
we'd instead only explicitly share the fields that we migrate (copy  
or get a reference) as we move the stack from the submitting  
task_struct to a waiting_task struct as the submission blocks.

We trade initial effort to make things safe in the presence of  
universal sharing for effort to introduce sharing as people notice  
deficient behaviour.  If that's the way we prefer to go, I'm cool  
with that.  I might have gone slightly nuts in preferring *identical*  
sync and async behaviour.

The fast path would look almost identical to the existing fibril  
switch.  We'd just have a few more fields to sync up between the two  
task_structs.

Ingo, am I getting this right?  This sounds pretty straight forward  
to prototype from the current patches.  I can certainly give it a try.

> it's quite cheap to 'flip' it to under any arbitrary user-space  
> context:
> change its thread_info->task pointer to the user-space context's task
> struct, copy the mm pointer, the fs pointer to the "worker thread",
> switch the thread_info, update ptregs - done. Hm?

Or maybe you're talking about having concurrent executing  
thread_info's pointing to the user-space submitting task_struct?   
That really does sound like the current fibril approach, with even  
more sharing of thread_info's that might be executing on other cpus?

Either way, I want to give it a try.  If we can measure it performing  
reasonably in the cached case then I think everyone's happy?

> is not part of the signal set. (Although it might make sense to make
> such async syscalls interruptible, just like any syscall.)

I think we all agree that they have to be interruptible by now,  
right?  If for no other reason than to interrupt pending poll with no  
timeout, say, as the task exits..

> The 'pool' of kernel threads doesnt even have to be per-task, it  
> can be
> a natural per-CPU thing

Yeah, absolutely.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 19:59           ` Alan
  2007-02-02 20:14             ` Linus Torvalds
@ 2007-02-05 16:44             ` Zach Brown
  1 sibling, 0 replies; 92+ messages in thread
From: Zach Brown @ 2007-02-05 16:44 UTC (permalink / raw)
  To: Alan
  Cc: Linus Torvalds, Ingo Molnar, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> Other questions really relate to the scheduling - Zach do you intend
> schedule_fibrils() to be a call code would make or just from  
> schedule() ?

I'd much rather keep the current sleeping API in as much as is  
possible.  So, yeah, if we can get schedule() to notice and behave  
accordingly I'd prefer that.  In the current code it's keyed off  
finding a stack allocation hanging off of current->.  If the caller  
didn't care about guaranteeing non-blocking submission then we  
wouldn't need that.. we could use a thread_info flag bit, or  
something.  Avoiding that allocation in the cached case would be nice.

> Alan (who used to use Co-routines in real languages on 36bit
> computers with 9bit bytes before learning C)

Yes, don't despair, I'm not co-routine ignorant.  In fact, I'm almost  
positive it was you who introduced them to me at some point in the  
previous millennium ;).

- z


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
  2007-02-01  8:36   ` Ingo Molnar
@ 2007-02-04  5:12   ` Davide Libenzi
  2007-02-05 17:54     ` Zach Brown
  1 sibling, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-04  5:12 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

On Tue, 30 Jan 2007, Zach Brown wrote:

> +	/* 
> +	 * XXX The idea is to copy all but the actual call stack.  Obviously
> +	 * this is wildly arch-specific and belongs abstracted out.
> +	 */
> +	*next->ti = *ti;
> +	*thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

arch copy_thread_info()?



> +	current->per_call = next->per_call;

Pointer instead of structure copy? percall_clone()/percall_free()?



> +	/* always switch to a runnable fibril if we aren't being preempted */
> +	if (unlikely(!(preempt_count() & PREEMPT_ACTIVE) &&
> +		     !list_empty(&prev->runnable_fibrils))) {
> +		schedule_fibril(prev);
> +		/* 
> +		 * finish_fibril_switch() drops the rq lock and enables
> +		 * premption, but the popfl disables interrupts again.  Watch
> +		 * me learn how context switch locking works before your very
> +		 * eyes!  XXX This will need to be fixed up by throwing
> +		 * together something like the prepare_lock_switch() path the
> +		 * scheduler does.  Guidance appreciated!
> +		 */
> +		local_irq_enable();
> +		return;
> +	}

Yes, please (prepare/finish) ... ;)



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
@ 2007-02-03 14:05 linux
  0 siblings, 0 replies; 92+ messages in thread
From: linux @ 2007-02-03 14:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux, mingo, zach.brown

First of all, may I say, this is a wonderful piece of work.
It absolutely reeks of The Right Thing.  Well done!

However, while I need to study it in a lot more detail, I think Ingo's
implementation ideas make a lot more immediate sense.  It's the same
idea that I thought up.

Let me make it concrete.  When you start an async system call:
- Preallocate a second kernel stack, but don't do anything
  with it.  There should probably be a per-CPU pool of
  preallocated threads to avoid too much allocation and
  deallocation.
- Also at this point, do any resource limiting.
- Set the (normally NULL) "thread blocked" hook pointer to
  point to a handler, as explained below.
- Start down the regular system call path.
- In the fast-path case, the system call completes without blocking and
  we set up the completion structure and return to user space.
  We may want to return a special value to user space to tell it that
  there's no need to call asys_await_completion.  I think of it as the
  Amiga's IOF_QUICK.
- Also, when returning, check and clear the thread-blocked hook.

Note that we use one (cache-hot) stack for everything and do as little
setup as possible on the fast path.

However, if something blocks, it hits the slow path:
- If something would block the thread, the scheduler invokes the
  thread-blocked hook before scheduling a new thread.
- The hook copies the necessary state to a new (preallocated) kernel
  stack, which takes over the original caller's identity, so it can return
  immediately to user space with an "operation in progress" indicator.
- The scheduler hook is also cleared.
- The original thread is blocked.
- The new thread returns to user space and execution continues.

- The original thread completes the system call.  It may block again,
  but as its block hook is now clear, no more scheduler magic happens.

- When the operation completes and returns to sys_sys_submit(), it
  notices that its scheduler hook is no longer set.  Thus, this is a
  kernel-only worker thread, and it fills in the completion structure,
  places itself back in the available pool, and commits suicide.

Now, there is no chance that we will ever implement kernel state machines
for every little ioctl.  However, there may be some "async fast paths"
state machines that we can use.  If we're in a situation where we can
complete the operation without a kernel thread at all, then we can
detect the "would block" case (probably in-line, but you could
use a different scheduler hook function) and set up the state machine
structure.  Then return "operation in progress" and let the I/O
complete in its own good time.

Note that you don't need to implement all of a system call as an explicit
state machine; only its completion.  So, for example, you could do
indirect block lookups via an implicit (stack-based) state machine,
but the final I/O via an explicit one.  And you could do this only for
normal block devices and not NFS.  You only need to convert the hot
paths to the explicit state machine form; the bulk of the kernel code
can use separate kernel threads to do async system calls.

I'm also in the "why do we need fibrils?" camp.  I'm studying the code,
and looking for a reason, but using the existing thread abstraction
seems best.  If you encountered some fundamental reason why kernel threads
were Really Really Hard, then maybe it's worth it, but it's a new entity,
and entia non sunt multiplicanda praeter necessitatem.


One thing you can do for real-time tasks is, in addition to the
non-blocking flag (return EAGAIN from asys_submit rather than blocking),
you could have an "atomic" flag that would avoid blocking to preallocate
the additional kernel thread!  Then you'd really be guaranteed no
long delays, ever.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  9:25                     ` Matt Mackall
@ 2007-02-03 10:03                       ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2007-02-03 10:03 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise


* Matt Mackall <mpm@selenic.com> wrote:

> On Sat, Feb 03, 2007 at 09:23:08AM +0100, Ingo Molnar wrote:
> > The normal and most optimal workflow should be a user-space ring-buffer 
> > of these constant-size struct async_syscall entries:
> > 
> >   struct async_syscall ringbuffer[1024];
> > 
> >   LIST_HEAD(submitted);
> >   LIST_HEAD(pending);
> >   LIST_HEAD(completed);
> 
> It's wrong to call this a ring buffer as things won't be completed in
> any particular order. [...]

yeah, i realized this when i sent the mail. I wanted to say 'array of 
elements' - and it's clear from these list heads that it's fully out of 
order. (it should be an array so that the pages of those entries can be 
pinned and that completion can be manipulated from any context, 
anytime.)

(the queueing i described closely resembles Tux's "Tux syscall request" 
handling scheme.)

> [...] So you'll need a fourth list head for which buffer elements are 
> free. At which point, you might as well leave it entirely up to the 
> application to manage the allocation of async_syscall structs. It may 
> know it only needs two, or ten thousand, or five per client...

sure - it should be variable but still the array should be compact, and 
should be registered with the kernel. That way security checks can be 
done once, the pages can be pinned, accessed anytime, etc.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  8:23                   ` Ingo Molnar
@ 2007-02-03  9:25                     ` Matt Mackall
  2007-02-03 10:03                       ` Ingo Molnar
  2007-02-05 17:44                     ` Zach Brown
  1 sibling, 1 reply; 92+ messages in thread
From: Matt Mackall @ 2007-02-03  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Sat, Feb 03, 2007 at 09:23:08AM +0100, Ingo Molnar wrote:
> The normal and most optimal workflow should be a user-space ring-buffer 
> of these constant-size struct async_syscall entries:
> 
>   struct async_syscall ringbuffer[1024];
> 
>   LIST_HEAD(submitted);
>   LIST_HEAD(pending);
>   LIST_HEAD(completed);

It's wrong to call this a ring buffer as things won't be completed in
any particular order. So you'll need a fourth list head for which
buffer elements are free. At which point, you might as well leave it
entirely up to the application to manage the allocation of
async_syscall structs. It may know it only needs two, or ten thousand,
or five per client...

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  0:56                 ` Linus Torvalds
  2007-02-03  7:15                   ` Suparna Bhattacharya
@ 2007-02-03  8:23                   ` Ingo Molnar
  2007-02-03  9:25                     ` Matt Mackall
  2007-02-05 17:44                     ` Zach Brown
  1 sibling, 2 replies; 92+ messages in thread
From: Ingo Molnar @ 2007-02-03  8:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> > 
> > Well, in my picture, 'only if you block' is a pure thread 
> > utilization decision: bounce a piece of work to another thread if 
> > this thread cannot complete it. (if the kernel is lucky enough that 
> > the user context told it "it's fine to do that".)
> 
> Sure, you can do it that way too. But at that point, your argument 
> that we shouldn't do it with fibrils is wrong: you'd still need 
> basically the exact same setup that Zach does in his fibril stuff, and 
> the exact same hook in the scheduler, testing the exact same value 
> ("do we have a pending queue of work").

did i ever lose a single word of complaint about those bits? Those are 
not an issue to me. They can be applied to kernel threads just as much.

As i babbled in the very first email about this topic:

| 1) improve our basic #1 design gradually. If something is a
|    bottleneck, if the scheduler has grown too fat, cut some slack. If 
|    micro-threads or fibrils offer anything nice for our basic thread 
|    model: integrate it into the kernel.

i should have said explicitly that to flip user-space from one kernel 
thread to another one (upon blocking or per request) is a nice thing and 
we should integrate that into the kernel's thread model.

But really, being a scheduler guy i was much more concerned about the 
duplication and problems caused by the fibril concept itself - which 
duplication and complexity makes up 80% of Zach's submitted patchset.
For example this bit:

   [PATCH 3 of 4] Teach paths to wake a specific void * target

would totally go away if we used kernel threads for this. In the fibril 
approach this is where the mess starts. Either a 'normal' wakeup has to 
wake up all fibrils, or we have to make damn sure that a wakeup that in 
reality goes to a fibril is never woken via wake_up/wake_up_process.

( Furthremore, i tried to include user-space micro-threads in the 
  argument as well, which Evgeniy Polyako raised not so long ago related 
  to the kevent patchset. All these micro-thread things are of a similar 
  genre. )

i totally agree that the API /should/ be the main focus - but i didnt 
pick the topic and most of the patchset's current size is due to the IMO 
avoidable fibril concept.

regarding the API, i dont really agree with the current form and design 
of Zach's interface.

fundamentally, the basic entity of this thing should be a /system call/, 
not the artificial fibril thing:

  +struct asys_call {
  +       struct asys_result      *result;
  +       struct fibril           fibril;
  +};

i.e. the basic entity should be something that represents a system call, 
with its up to 6 arguments, the later return code, state, flags and two 
list entries:

  struct async_syscall {
	unsigned long nr;
	unsigned long args[6];
	long err;
	unsigned long state;
	unsigned long flags;
	struct list_head list;
	struct list_head wait_list;
	unsigned long __pad[2];
  };

(64 bytes on 32-bit, 128 bytes on 64-bit)

furthermore, i think this API should be fundamentally vectored and 
fundamentally async, and hence could solve another issue as well: 
submitting many little pieces of work of different IO domains in one go.

[ detail: there should be no traditional signals used at all (Zach's 
  stuff doesnt use them, and correctly so), only if the async syscall 
  that is performed generates a signal. ]

The normal and most optimal workflow should be a user-space ring-buffer 
of these constant-size struct async_syscall entries:

  struct async_syscall ringbuffer[1024];

  LIST_HEAD(submitted);
  LIST_HEAD(pending);
  LIST_HEAD(completed);

the 3 list heads are both known to the kernel and to user-space, and are 
actively managed by both. The kernel drives the execution of the async 
system calls based on the 'submitted' list head (until it empties it) 
and moves them over to the 'pending' list. User-space can complete async 
syscalls based on the 'completed' list. (but a sycall can optinally be 
marked as 'autocomplete' as well via the 'flags' field, in that case 
it's not moved to the 'completed' list but simply removed from the 
'pending' list. This can be useful for system calls that have some 
implicit notification effect.)

( Note: optionally, a helper kernel-thread, when it finishes processing 
  a syscall, could also asynchronously check the 'submitted' list and 
  pick up new work. That would allow the submission of new syscalls 
  without any entry into the kernel. So for example on an SMT system, 
  this could result in essence one CPU could running in pure user-space 
  submitting async syscalls via the ringbuffer, while another CPU would
  in essence be running pure kernel-space, executing those entries. )

another crutial bit is the waiting on pending work. But because every 
pending syscall entity is either already completed or has a real kernel 
thread associated with it, that bit is mostly trivial: user-space can 
wait on 'any' pending syscall to complete, or it could wait for a 
specific list of syscalls to complete (using the ->wait_list). It could 
also wait on 'a minimum number of N syscalls to complete' - to create 
batching of execution. And of course it can periodically check the 
'completed' list head if it has a constant and highly parallel flow of 
workload - that way the 'waiting' does not actually have to happen most 
of the time.

Looks like we can hit many birds with this single stone: AIO, vectored 
syscalls, finegrained system-call parallelism. Hm?

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  0:56                 ` Linus Torvalds
@ 2007-02-03  7:15                   ` Suparna Bhattacharya
  2007-02-03  8:23                   ` Ingo Molnar
  1 sibling, 0 replies; 92+ messages in thread
From: Suparna Bhattacharya @ 2007-02-03  7:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio, Benjamin LaHaise

On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote:
> 
> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> > 
> > Well, in my picture, 'only if you block' is a pure thread utilization 
> > decision: bounce a piece of work to another thread if this thread cannot 
> > complete it. (if the kernel is lucky enough that the user context told 
> > it "it's fine to do that".)
> 
> Sure, you can do it that way too. But at that point, your argument that we 
> shouldn't do it with fibrils is wrong: you'd still need basically the 
> exact same setup that Zach does in his fibril stuff, and the exact same 
> hook in the scheduler, testing the exact same value ("do we have a pending 
> queue of work").
> 
> So at that point, you really are arguing about a rather small detail in 
> the implementation, I think.
> 
> Which is fair enough. 
> 
> But I actually think the *bigger* argument and problems are elsewhere, 
> namely in the interface details. Notably, I think the *real* issues end up 
> how we handle synchronization, and how we handle signalling. Those are in 
> many ways (I think) more important than whether we actually can schedule 
> these trivial things on multiple CPU's concurrently or not.
> 
> For example, I think serialization is potentially a much more expensive 
> issue. Could we, for example, allow users to serialize with these things 
> *without* having to go through the expense of doing a system call? Again, 
> I'm thinking of the case of no IO happening, in which case there also 
> won't be any actual threading taking place, in which case it's a total 
> waste of time to do a system call at all.
> 
> And trying to do that actually has implications for the interfaces (like 
> possibly returning a zero cookie for the async() system call if it was 
> doable totally synchronously?)

This would be useful - the application wouldn't have to set up state
to remember for handling completions for operations that complete synchronously
I know Samba folks would like that.

The laio_syscall implementation (Lazy asynchronous IO) seems to have
experimented with such an interface
http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html

Regards
Suparna

> 
> Signal handling is similar: I actually think that a "async()" system call 
> should be interruptible within the context of the caller, since we would 
> want to *try* to execute it synchronously. That automatically means that 
> we have semantic meaning for fibrils and signal handling.
> 
> Finally, can we actually get POSIX aio semantics with this? Can we 
> implement the current aio_xyzzy() system calls using this same feature? 
> And most importantly - does it perform well enough that we really can do 
> that?
> 
> THOSE are to me bigger questions than what happens inside the kernel, and 
> whether we actually end up using another thread if we end up doing it 
> non-synchronously.
> 
> 					Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:55               ` Ingo Molnar
@ 2007-02-03  0:56                 ` Linus Torvalds
  2007-02-03  7:15                   ` Suparna Bhattacharya
  2007-02-03  8:23                   ` Ingo Molnar
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-03  0:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Sat, 3 Feb 2007, Ingo Molnar wrote:
> 
> Well, in my picture, 'only if you block' is a pure thread utilization 
> decision: bounce a piece of work to another thread if this thread cannot 
> complete it. (if the kernel is lucky enough that the user context told 
> it "it's fine to do that".)

Sure, you can do it that way too. But at that point, your argument that we 
shouldn't do it with fibrils is wrong: you'd still need basically the 
exact same setup that Zach does in his fibril stuff, and the exact same 
hook in the scheduler, testing the exact same value ("do we have a pending 
queue of work").

So at that point, you really are arguing about a rather small detail in 
the implementation, I think.

Which is fair enough. 

But I actually think the *bigger* argument and problems are elsewhere, 
namely in the interface details. Notably, I think the *real* issues end up 
how we handle synchronization, and how we handle signalling. Those are in 
many ways (I think) more important than whether we actually can schedule 
these trivial things on multiple CPU's concurrently or not.

For example, I think serialization is potentially a much more expensive 
issue. Could we, for example, allow users to serialize with these things 
*without* having to go through the expense of doing a system call? Again, 
I'm thinking of the case of no IO happening, in which case there also 
won't be any actual threading taking place, in which case it's a total 
waste of time to do a system call at all.

And trying to do that actually has implications for the interfaces (like 
possibly returning a zero cookie for the async() system call if it was 
doable totally synchronously?)

Signal handling is similar: I actually think that a "async()" system call 
should be interruptible within the context of the caller, since we would 
want to *try* to execute it synchronously. That automatically means that 
we have semantic meaning for fibrils and signal handling.

Finally, can we actually get POSIX aio semantics with this? Can we 
implement the current aio_xyzzy() system calls using this same feature? 
And most importantly - does it perform well enough that we really can do 
that?

THOSE are to me bigger questions than what happens inside the kernel, and 
whether we actually end up using another thread if we end up doing it 
non-synchronously.

					Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:17                       ` Linus Torvalds
  2007-02-03  0:04                         ` Alan
@ 2007-02-03  0:23                         ` bert hubert
  1 sibling, 0 replies; 92+ messages in thread
From: bert hubert @ 2007-02-03  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Alan, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Fri, Feb 02, 2007 at 03:17:57PM -0800, Linus Torvalds wrote:

> threads. But you need to look at what it is we parallelize here, and ask 
> yourself why we're doing what we're doing, and why people aren't *already* 
> just using a separate thread for it.

Partially this is for the bad reason that creating "i/o threads" (or even
processes) has a bad stigma to it, and additionally has always felt crummy.

On the first reason, the 'pain' of creating threads is actually rather
minor, so this feeling may have been wrong. The main thing is that you don't
wantonly create a thousand i/o threads, whereas you conceivably might want
to have a thousand outstanding i/o requests. At least I know I want to have
that ability.

Secondly, the actual mechanics of i/o processes isn't trivial, and feels
wasteful with lots of additional copying, or in the case of threads,
queueing and posting.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:17                       ` Linus Torvalds
@ 2007-02-03  0:04                         ` Alan
  2007-02-03  0:23                         ` bert hubert
  1 sibling, 0 replies; 92+ messages in thread
From: Alan @ 2007-02-03  0:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> When parallelising "real work", I absolutely agree with you: we should use 
> threads. But you need to look at what it is we parallelize here, and ask 
> yourself why we're doing what we're doing, and why people aren't *already* 
> just using a separate thread for it.

Because its a pain in the arse and because its very hard to self tune. If
you've got async_anything then the thread/fibril/synchronous/whatever
decision can be made kernel side based upon expected cost and other
tradeoffs, even if its as dumb as per syscall or per syscall/filp type
guessing.

Alan

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:37             ` Davide Libenzi
@ 2007-02-03  0:02               ` Davide Libenzi
  2007-02-05 17:12               ` Zach Brown
  2007-02-05 21:36               ` bert hubert
  2 siblings, 0 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-03  0:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Fri, 2 Feb 2007, Davide Libenzi wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> > in fact an app could also /trigger/ the execution of a syscall in a 
> > different context - to create parallelism artificially - without any 
> > blocking event. So we could do:
> > 
> >   cookie1 = sys_async(sys_read, params);
> >   cookie2 = sys_async(sys_write, params);
> > 
> >   [ ... calculation loop ... ]
> > 
> >   wait_on_async_syscall(cookie1);
> >   wait_on_async_syscall(cookie2);
> > 
> > or something like that. Without user-space having to create threads 
> > itself, etc. So basically, we'd make kernel threads more useful, and 
> > we'd make threading safer - by only letting syscalls thread.
> 
> Since I still think that the many-thousands potential async operations 
> coming from network sockets are better handled with a classical event 
> machanism [1], and since smooth integration of new async syscall into the 
> standard POSIX infrastructure is IMO a huge win, I think we need to have a 
> "bridge" to allow async completions being detectable through a pollable 
> (by the mean of select/poll/epoll whatever) device.
> In that way you can handle async operations with the best mechanism that 
> is fit for them, and gather them in a single async scheduler.

To clarify further, below are the API and the use case of my userspace 
implementation. The guasi_fd() gives you back a pollable (POLLIN) fd to be 
integrated in your prefered event retrieval interface. Once it fd is 
signaled, you can fetch your completed requests using guasi_fetch() and 
schedule work based on that.
The GUASI implementation uses pthreads, but it is clear that an in-kernel 
async syscall implementation can take wiser decisions, and optimize the 
heck out of it (locks, queues, ...).




- Davide



/*
 * Example of async pread using GUASI
 */
static long guasi_wrap__pread(void *priv, long const *params) {

        return (long) pread((int) params[0], (void *) params[1],
                            (size_t) params[2], (off_t) params[3]);
}
 
guasi_req_t guasi__pread(guasi_t hctx, void *priv, void *asid, int prio,
                         int fd, void *buf, size_t size, off_t off) {

        return guasi_submit(hctx, priv, asid, prio, guasi_wrap__pread, 4,
                            (long) fd, (long) buf, (long) size, (long) off);
}


---
/*
 *  guasi by Davide Libenzi (generic userspace async syscall implementation)
 *  Copyright (C) 2003  Davide Libenzi
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *  Davide Libenzi <davidel@xmailserver.org>
 *
 */

#if !defined(_GUASI_H)
#define _GUASI_H


#define GUASI_MAX_PARAMS 16

#define GUASI_STATUS_PENDING 1
#define GUASI_STATUS_ACTIVE 2
#define GUASI_STATUS_COMPLETE 3


typedef long (*guasi_syscall_t)(void *, long const *);

typedef struct s_guasi { } *guasi_t;
typedef struct s_guasi_req { } *guasi_req_t;

struct guasi_reqinfo {
	void *priv;   /* Call private data. Passed to guasi_submit */
	void *asid;   /* Async request ID. Passed to guasi_submit */
	long result;  /* Return code of "proc" passed to guasi_submit */
	long error;   /* errno */
	int status;   /* GUASI_STATUS_* */
};



guasi_t guasi_create(int min_threads, int max_threads, int max_priority);
void guasi_free(guasi_t hctx);
int guasi_fd(guasi_t hctx);
guasi_req_t guasi_submit(guasi_t hctx, void *priv, void *asid, int prio,
			 guasi_syscall_t proc, int nparams, ...);
int guasi_fetch(guasi_t hctx, guasi_req_t *reqs, int nreqs);
int guasi_req_info(guasi_req_t hreq, struct guasi_reqinfo *rinf);
void guasi_req_free(guasi_req_t hreq);

#endif



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:49             ` Linus Torvalds
@ 2007-02-02 23:55               ` Ingo Molnar
  2007-02-03  0:56                 ` Linus Torvalds
  0 siblings, 1 reply; 92+ messages in thread
From: Ingo Molnar @ 2007-02-02 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> THAT'S THE POINT. That's what makes fibrils cooperative. The "only if 
> you block" is really what makes a fibril be something else than a 
> regular thread.

Well, in my picture, 'only if you block' is a pure thread utilization 
decision: bounce a piece of work to another thread if this thread cannot 
complete it. (if the kernel is lucky enough that the user context told 
it "it's fine to do that".)

it is 'incidental parallelism' instead of 'intentional parallelism', but 
the random and unpredictable nature of it doesnt change anything about 
the fundamental fact: we start a new thread of execution in essence.

Typically it will be rare in a workload as it will be driven by 
cachemisses, but for example in DB workloads the 'cachemiss' will be the 
/common case/ - because the DB manages the cache itself.

And how to run a thread of execution is a fundamental /scheduling/ 
decision: it is the acceptance of and the adoption to the cost of work 
migration - if no forced wait happens then often it's cheaper to execute 
all work locally and serially.

[ in fact, such a mechanism doesnt even always have to be driven from
  the scheduler itself: such a 'bounce current work to another thread'
  event could occur when we detect that a pagecache page is missing and
  that we have to do a ->readpage, etc. Tux does that since 1999: the
  cutoff for 'bounce work' was when a soft cache (the pagecache or the
  dentry cache) was missed - not when we went into the IO path. This has
  the advantage that the Tux cachemiss threads could do /all/ the IO
  preparation and IO completion on the same CPU and in one go - while
  the user context was able to continue executing. ]

But this is also a function of hardware: for example on a Transputer i'd 
bounce off all such work immediately (even if it's a sys_time() 
syscall), all the time, even if fully cached, no exceptions, because the 
hardware is such that another CPU can pick it up in the next cycle.

while we definitely dont want to bounce short-lived cached syscalls to 
another thread, for longer ones or ones which we /expect/ to block we 
might want to do it like that straight away. [Especially on a multi-core 
CPU that has a shared L2 cache (and doubly so on a HT/SMT CPU that has a 
shared L1 cache).]

i dont see anything here that mandates (or even strongly supports) the 
notion of cooperative scheduling. The moment a context sees a 'cache 
miss', it is totally fair to potentially distribute it to other CPUs. It 
wont run for a long time and it will be totally cache-cold when the 'IO 
done' event occurs - hence we should schedule it where the IO event 
occured. Which might easily be the same CPU where the user context is 
running right now (we prefer last-run CPUs on wakeups), but not 
necessarily - it's a general scheduling decision.

> > Note: such a 'flip' would only occur when the original context 
> > blocks, /not/ on every async syscall.
> 
> Right.
> 
> So can you take a look at Zach's fibril idea again? Because that's 
> exactly what it does. It basically sets a flag, saying "flip to this 
> when you block or yield". Of course, it's a bit bigger than just a 
> flag, since it needs to describe what to flip to, but that's the basic 
> idea.

i know Zach's code ... i really do. Even if i didnt look at the code 
(which i did), Jonathon Corbet did a very nice writeup about fibrils on 
LWN.net two days ago, which i've read as well:

  http://lwn.net/Articles/219954/

So there's no misunderstanding on my side i think.

> Now, if you want to make fibrils *also* then actually use a separate 
> thread, that's an extension.

oh please, Linus. I /did/ suggest this as an extension to Zach's idea! 
Look at the Subject line - i'm reacting to the specific fibril code of 
Zach. I wrote this:

| as per my other email, i dont really like this concept. This is the 
| killer:
|
| > [...]  There can be multiple of them in the process of executing for 
| > a given task_struct, but only one can every be actively running at a 
| > time. [...]
|
| there's almost no scheduling cost from being able to arbitrarily 
| schedule a kernel thread - but there are /huge/ benefits in it.
|
| would it be hard to redo your AIO patches based on a pool of plain 
| simple kernel threads?

see http://lkml.org/lkml/2007/2/1/40.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:21           ` Ingo Molnar
  2007-02-02 22:49             ` Linus Torvalds
@ 2007-02-02 23:37             ` Davide Libenzi
  2007-02-03  0:02               ` Davide Libenzi
                                 ` (2 more replies)
  2007-02-05 17:02             ` Zach Brown
  2 siblings, 3 replies; 92+ messages in thread
From: Davide Libenzi @ 2007-02-02 23:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Fri, 2 Feb 2007, Ingo Molnar wrote:

> in fact an app could also /trigger/ the execution of a syscall in a 
> different context - to create parallelism artificially - without any 
> blocking event. So we could do:
> 
>   cookie1 = sys_async(sys_read, params);
>   cookie2 = sys_async(sys_write, params);
> 
>   [ ... calculation loop ... ]
> 
>   wait_on_async_syscall(cookie1);
>   wait_on_async_syscall(cookie2);
> 
> or something like that. Without user-space having to create threads 
> itself, etc. So basically, we'd make kernel threads more useful, and 
> we'd make threading safer - by only letting syscalls thread.

Since I still think that the many-thousands potential async operations 
coming from network sockets are better handled with a classical event 
machanism [1], and since smooth integration of new async syscall into the 
standard POSIX infrastructure is IMO a huge win, I think we need to have a 
"bridge" to allow async completions being detectable through a pollable 
(by the mean of select/poll/epoll whatever) device.
In that way you can handle async operations with the best mechanism that 
is fit for them, and gather them in a single async scheduler.



[1] Unless you really want to have thousands of kthreads/fibrils lingering 
    on the system.



- Davide



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:01                     ` Linus Torvalds
@ 2007-02-02 23:17                       ` Linus Torvalds
  2007-02-03  0:04                         ` Alan
  2007-02-03  0:23                         ` bert hubert
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 23:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Linus Torvalds wrote: 
> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> > 
> > but if the application /has/ identified fundamental parallelism, we 
> > /must not/ shut that parallelism off by /designing/ this interface to 
> > use the fibril thing which is a limited cooperative, single-CPU entity. 
> 
> Right. We should for example encourage people to use some kind of 
> paralellizing construct. 
> 
> I know! We could even call them "threads", so to give people the idea that 
> they are independent smaller entities in a thicker "rope", and we could 
> call that bigger entity a "task" or "process", since it "processes" data.
> 
> Or is that just too far out?

So the above was obviously tongue-in-cheek, but you should really think 
about the context here.

We're discussing doing *single* system calls. There is absolutely zero 
point to try to parallelize the work over multiple CPU's or threads. We're 
literally talking about doing things where the actual CPU cost is in the 
hundreds of nanoseconds, and where traditionally a rather noticeable part 
of the cost is not the code itself, but the high cost of taking a system 
call trap, and saving all the register state.

When parallelising "real work", I absolutely agree with you: we should use 
threads. But you need to look at what it is we parallelize here, and ask 
yourself why we're doing what we're doing, and why people aren't *already* 
just using a separate thread for it.

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:42                   ` Ingo Molnar
@ 2007-02-02 23:01                     ` Linus Torvalds
  2007-02-02 23:17                       ` Linus Torvalds
  0 siblings, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 23:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> but if the application /has/ identified fundamental parallelism, we 
> /must not/ shut that parallelism off by /designing/ this interface to 
> use the fibril thing which is a limited cooperative, single-CPU entity. 

Right. We should for example encourage people to use some kind of 
paralellizing construct. 

I know! We could even call them "threads", so to give people the idea that 
they are independent smaller entities in a thicker "rope", and we could 
call that bigger entity a "task" or "process", since it "processes" data.

Or is that just too far out?

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:21           ` Ingo Molnar
@ 2007-02-02 22:49             ` Linus Torvalds
  2007-02-02 23:55               ` Ingo Molnar
  2007-02-02 23:37             ` Davide Libenzi
  2007-02-05 17:02             ` Zach Brown
  2 siblings, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 22:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> Note: such a 'flip' would only occur when the original context blocks, 
> /not/ on every async syscall.

Right.

So can you take a look at Zach's fibril idea again? Because that's exactly 
what it does. It basically sets a flag, saying "flip to this when you 
block or yield". Of course, it's a bit bigger than just a flag, since it 
needs to describe what to flip to, but that's the basic idea.

Now, if you want to make fibrils *also* then actually use a separate 
thread, that's an extension. But you were arguing as if they should use 
threads to begin with, and that sounds stupid. Now you seem to retract it, 
since you say "only if you need to block".

THAT'S THE POINT. That's what makes fibrils cooperative. The "only if you 
block" is really what makes a fibril be something else than a regular 
thread. 

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 21:30                 ` Linus Torvalds
  2007-02-02 22:42                   ` Ingo Molnar
@ 2007-02-02 22:48                   ` Alan
  1 sibling, 0 replies; 92+ messages in thread
From: Alan @ 2007-02-02 22:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> > The brown and sticky will hit the rotating air impeller pretty hard if you
> > are not very careful about how that ends up scheduled
> 
> Why do you think that?
> 
> With cooperative scheduling (like the example Zach posted), there is 
> absolutely no "brown and sticky" wrt any CPU usage. Which is why 
> cooperative scheduling is a *good* thing. If you want to blow up your 
> 1024-node CPU cluster, you'd to it with "real threads".

You end up with a lot more things running asynchronously. In the current
world we see a series of requests for attributes and hopefully we do
readahead and all is neatly ordered. If fibrils are not ordered the same
way then we could make it worse as we might not pick the right readahead
for example.

> Since we'd only create fibrils on a system call entry level, and system 
> calls are independent, how would you do that anyway?

If we stick to that limit it ought to be ok. We've been busy slapping
people who call sys_*, except for internal magic like kernel_thread

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 21:30                 ` Linus Torvalds
@ 2007-02-02 22:42                   ` Ingo Molnar
  2007-02-02 23:01                     ` Linus Torvalds
  2007-02-02 22:48                   ` Alan
  1 sibling, 1 reply; 92+ messages in thread
From: Ingo Molnar @ 2007-02-02 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan, Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> With cooperative scheduling (like the example Zach posted), there is 
> absolutely no "brown and sticky" wrt any CPU usage. Which is why 
> cooperative scheduling is a *good* thing. If you want to blow up your 
> 1024-node CPU cluster, you'd to it with "real threads".

i'm not worried about the 1024-node cluster case.

i also fully agree that in some cases /not/ going parallel and having a 
cooperative relationship between execution contexts can be good.

but if the application /has/ identified fundamental parallelism, we 
/must not/ shut that parallelism off by /designing/ this interface to 
use the fibril thing which is a limited cooperative, single-CPU entity. 
I cannot over-emphasise it how wrong that feels to me. Cooperativeness 
isnt bad, but it should be an /optional/ thing, not hardcoded into the 
design!

If the application tells us: "gee, you can execute this syscall in 
parallel!" (which AIO /is/ about after all), and if we have idle 
cores/hardware-threads nearby, it would be the worst thing to not 
execute that in parallel if the syscall blocks or if the app asks for 
that syscall to be executed in parallel right away, even in the cached 
case.

if we were in the 1.2 days i might agree that fibrils are perhaps easier 
on the kernel, but today the Linux kernel doesnt even use this 
cooperativeness anywhere. We have all the hard work done already. The 
full kernel is threaded. We can execute arbitrary number of kernel 
contexts off a single user context, we can execute parallel syscalls and 
we scale very well doing so.

all that is needed is this new facility and some scheduler hacking to 
enable "transparent, kernel-side threading". That enables AIO, 
coroutines and more. It brings threading to a whole new level, because 
it makes it readily and gradually accessible to single-threaded apps 
too.

[ and if we are worried about the 1024 CPU cluster (or about memory use) 
  then we could limit such threads to only overlap in a limited number, 
  etc. Just like we'd have to do with fibrils anyway. But with fibrils 
  we /force/ single-threadedness, which, i'm quite sure, is just about 
  the worst thing we can do. ]

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 15:56         ` Linus Torvalds
  2007-02-02 19:59           ` Alan
@ 2007-02-02 22:21           ` Ingo Molnar
  2007-02-02 22:49             ` Linus Torvalds
                               ` (2 more replies)
  1 sibling, 3 replies; 92+ messages in thread
From: Ingo Molnar @ 2007-02-02 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> > 
> > My only suggestion was to have a couple of transparent kernel threads 
> > (not fibrils) attached to a user context that does asynchronous 
> > syscalls! Those kernel threads would be 'switched in' if the current 
> > user-space thread blocks - so instead of having to 'create' any of them 
> > - the fast path would be to /switch/ them to under the current 
> > user-space, so that user-space processing can continue under that other 
> > thread!
> 
> But in that case, you really do end up with "fibrils" anyway.
> 
> Because those fibrils are what would be the state for the blocked 
> system calls when they aren't scheduled.
> 
> We may have a few hundred thousand system calls a second (maybe that's 
> not actually reasonable, but it should be what we *aim* for), and 99% 
> of them will hopefully hit the cache and never need any separate IO, 
> but even if it's just 1%, we're talking about thousands of threads.
> 
> I do _not_ think that it's reasonable to have thousands of threads 
> state around just "in case". Especially if all those threadlets are 
> then involved in signals etc - something that they are totally 
> uninterested in.
> 
> I think it's a lot more reasonable to have just the kernel stack page 
> for "this was where I was when I blocked". IOW, a fibril-like thing. 

ok, i think i noticed another misunderstanding. The kernel thread based 
scheme i'm suggesting would /not/ 'switch' to another kernel thread in 
the cached case, by default. It would just execute in the original 
context (as if it were a synchronous syscall), and the switch to a 
kernel thread from the pool would only occur /if/ the context is about 
to block. (this 'switch' thing would be done by the scheduler) 
User-space gets back an -EAIO error code immediately and transparently - 
but already running under the new kernel thread.

i.e. in the fully cached case there would be no scheduling at all - in 
fact no thread pool is needed at all.

regarding cost:

the biggest memory resource cost of a kernel thread (assuming it has no 
real user-space context) /is/ its kernel stack page, which is 4K or 8K. 
The task struct takes ~1.5K. Once we have a ready kernel thread around, 
it's quite cheap to 'flip' it to under any arbitrary user-space context: 
change its thread_info->task pointer to the user-space context's task 
struct, copy the mm pointer, the fs pointer to the "worker thread", 
switch the thread_info, update ptregs - done. Hm?

Note: such a 'flip' would only occur when the original context blocks, 
/not/ on every async syscall.

regarding CPU resource costs, i dont think there should be significant 
signal overhead, because the original task is still only one instance, 
and the kernel thread that is now running with the blocked kernel stack 
is not part of the signal set. (Although it might make sense to make 
such async syscalls interruptible, just like any syscall.)

The 'pool' of kernel threads doesnt even have to be per-task, it can be 
a natural per-CPU thing - and its size will grow/shrink [with a low 
update frequency] depending on how much AIO parallelism there is in the 
workload. (But it can also be strictly per-user-context - to make sure 
that a proper ->mm ->fs, etc. is set up and that when the async system 
calls execute they have all the right context info.)

and note the immediate scheduling benefits: if an app (say like 
OpenOffice) is single-threaded but has certain common ops coded as async 
syscalls, then if any of those syscalls blocks then it could utilize 
/more than one/ CPU. I.e. we could 'spread' a single-threaded app's 
processing to multiple cores/hardware-threads /without/ having to 
multi-thread the app in an intrusive way. I.e. this would be a 
finegrained threading of syscalls, executed as coroutines in essence. 
With fibrils all sorts of scheduling limitations occur and no 
parallelism is possible.

in fact an app could also /trigger/ the execution of a syscall in a 
different context - to create parallelism artificially - without any 
blocking event. So we could do:

  cookie1 = sys_async(sys_read, params);
  cookie2 = sys_async(sys_write, params);

  [ ... calculation loop ... ]

  wait_on_async_syscall(cookie1);
  wait_on_async_syscall(cookie2);

or something like that. Without user-space having to create threads 
itself, etc. So basically, we'd make kernel threads more useful, and 
we'd make threading safer - by only letting syscalls thread.

> What I like about fibrils is that they should be able to handle the 
> cached case well: the case where no "real" scheduling (just the fibril 
> stack switches) takes place.

the cached case (when a system call would not block at all) would not 
necessiate any switch to another kernel thread at all - the task just 
executes its system call as if it were synchronous!

that's the nice thing: we can do this switch-to-another-kernel-thread 
magic thing right in the scheduler when we block - and the switched-to 
thread will magically return to user-space (with a -EAIO return code) as 
if nothing happened (while the original task blocks). I.e. under this 
scheme i'm suggesting we have /zero/ setup cost in the cached case. The 
optimistic case just falls through and switches to nothing else. Any 
switching cost only occurs in the slowpath - and even that cost is very 
low.

once a kernel thread that ran off with the original stack finishes the 
async syscall and wants to return the return code, this can be gathered 
via a special return-code ringbuffer that notifies finished syscalls. (A 
magic cookie is associated to every async syscall.)

> So the state machine argument is totally bogus - it results in a 
> programming model that simply doesn't match the *normal* setup. You 
> want the kernel programming model to appear "linear" even when it 
> isn't, because it's too damn hard to think nonlinearly.
>
> Yes, we could do pathname lookup with that kind of insane setup too. 
> But it would be HORRID!

yeah, but i guess not nearly as horrid as writing a new OS from scratch 
;-)

seriously, i very much think and agree that programming state machines 
is hard and not desired in most of the kernel. But it can be done, and 
sometimes (definitely not in the common case) it's /cleaner/ than 
functional programming. I've programmed an HTTP and an FTP in-kernel 
server via a state machine and it worked better than i initially 
expected. It needs different thinking but there /are/ people around with 
that kind of thinking, so we just cannot exclude the possibility. [ It's 
just that such people usually dedicate their brain to mental 
fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ]

> [...] The fact that networking (and TCP in particular) has state 
> machines is because it is a packetized environment.

rough ballpark figures: for things like webserving or fileserving (or 
mailserving), networking sockets are the reason for context-blocking 
events in 90% of the cases (mostly due to networking latency). 9% of the 
blocking happens due to plain block IO, and 1% happens due to VFS 
metadata (inode, directory, etc.) blocking.

( in Tux i had to handle /all/ of these sources of blocking because even
  1% kills your performance if you do a hundred thousand requests per
  second - but in terms of design weight, networking is pretty damn
  important. )

and interestingly, modern IO frameworks tend to gravitate towards a 
packetized environment as well. I.e. i dont think state machines are 
/that/ unimportant.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 20:14             ` Linus Torvalds
  2007-02-02 20:58               ` Davide Libenzi
@ 2007-02-02 21:30               ` Alan
  2007-02-02 21:30                 ` Linus Torvalds
  1 sibling, 1 reply; 92+ messages in thread
From: Alan @ 2007-02-02 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> They are NOT the generic co-routine that some languages support natively. 
> So I think trying to call them coroutines would be even more misleading 
> than calling them fibrils.

Its actually pretty damned close the Honeywell B co-routine package, with
a kernel twist to be honest.

> ends up looking like. There's a good reason why people program mostly in 
> linear flow: that's how people think consciously - even if it's obviously 
> not how the brain actually works).

The IRQ example below is an example of how it linearizes - so it cuts
both ways like most tools, admittedly one of the blades is at the handle
end in this case ...

> Basically, what I'm hoping can come out of this (and this is a simplistic 
> example, but perhaps exactly *because* of that it hopefully also shows 
> that we canactually make *simple* interfaces for complex asynchronous 
> things):
> 
> 	struct one_entry *prev = NULL;
> 	struct dirent *de;
> 
> 	while ((de = readdir(dir)) != NULL) {
> 		struct one_entry *entry = malloc(..);
> 
> 		/* Add it to the list, fill in the name */
> 		entry->next = prev;
> 		prev = entry;
> 		strcpy(entry->name, de->d_name);
> 
> 		/* Do the stat lookup async */
> 		async_stat(de->d_name, &entry->stat_buf);
> 	}
> 	wait_for_async();

The brown and sticky will hit the rotating air impeller pretty hard if you
are not very careful about how that ends up scheduled. Its one thing to
exploit the ability to pull all the easy lookups out in advance, and
another having created all the parallelism to turn into into sane disk
scheduling and wakeups without scaling hit. But you do at least have the
opportunity to exploit it I guess.

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane
> 
> You forgot "very hard to think about". 

I'm not sure handing a fibril off to another task is that hard to think
about. It's not easy to turn it around as an async_exit() keeping the
other fibrils around because of the mass of rules and behaviours tied to
process exit but its perhaps not impossible.

Other minor evil. If we use fibrils we need to be careful we
know in advance how many fibrils an operation needs so we don't deadlock
on them in critical places like writeout paths when we either hit the per
task limit or we have no page for another stack.

Alan

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 21:30               ` Alan
@ 2007-02-02 21:30                 ` Linus Torvalds
  2007-02-02 22:42                   ` Ingo Molnar
  2007-02-02 22:48                   ` Alan
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 21:30 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise



On Fri, 2 Feb 2007, Alan wrote:
> 
> The brown and sticky will hit the rotating air impeller pretty hard if you
> are not very careful about how that ends up scheduled

Why do you think that?

With cooperative scheduling (like the example Zach posted), there is 
absolutely no "brown and sticky" wrt any CPU usage. Which is why 
cooperative scheduling is a *good* thing. If you want to blow up your 
1024-node CPU cluster, you'd to it with "real threads".

Also, with sane default limits of fibrils per process (say, in the 5-10), 
it also ends up beign good for IO. No "insane" IO bombs, but an easy way 
for users to just just get a reasonable amount of IO parallelism without 
having to use threading (which is hard).

So, best of both worlds.

Yes, *of*course* you want to have limits on outstanding work. And yes, a 
database server would set those limits much higher ("Only a thousand 
outstanding IO requests? Can we raise that to ten thousand, please?") than 
a regular process ("default: 5, and the super-user can raise it for you if 
you're good").

But there really shouldn't be any downsides.

(Of course, there will be downsides. I'm sure there will be. But I don't 
see any really serious and obvious ones).

> Other minor evil. If we use fibrils we need to be careful we
> know in advance how many fibrils an operation needs so we don't deadlock
> on them in critical places like writeout paths when we either hit the per
> task limit or we have no page for another stack.

Since we'd only create fibrils on a system call entry level, and system 
calls are independent, how would you do that anyway?

Once a fibril has been created, it will *never* depend on any other fibril 
resources ever again. At least not in any way that any normal non-fibril 
call wouldn't already do as far as I can see.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 20:58               ` Davide Libenzi
@ 2007-02-02 21:09                 ` Linus Torvalds
  0 siblings, 0 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 21:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alan, Ingo Molnar, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise



On Fri, 2 Feb 2007, Davide Libenzi wrote:
> 
> Actually, coroutines are not too bad to program once you have a 
> total-coverage async scheduler to run them.

No, no, I don't disagree at all. In fact, I agree emphatically.

It's just that you need the scheduler to run them, in order to not "see" 
them as coroutines. Then, you can program everything *as*if* it was just a 
regular declarative linear language with multiple threads).

And that gets us the same programming interface as we always have, and 
people can forget about the fact that in a very real sense, they are using 
coroutines with the scheduler just keeping track of it all for them.

After all, that's what we do between processes *anyway*. You can 
technically see the kernel as one big program that uses coroutines and the 
scheduler just keeping track of every coroutine instance. It's just that I 
doubt that any kernel programmer really thinks in those terms. You *think* 
in terms of "threads".

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 20:14             ` Linus Torvalds
@ 2007-02-02 20:58               ` Davide Libenzi
  2007-02-02 21:09                 ` Linus Torvalds
  2007-02-02 21:30               ` Alan
  1 sibling, 1 reply; 92+ messages in thread
From: Davide Libenzi @ 2007-02-02 20:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan, Ingo Molnar, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1807 bytes --]

On Fri, 2 Feb 2007, Linus Torvalds wrote:

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane
> 
> You forgot "very hard to think about". 
> 
> We DO NOT want coroutines in general. It's clever, but it's
>  (a) impossible to do without language support that C doesn't have, or 
>      some really really horrid macro constructs that really only work for 
>      very specific and simple cases.
>  (b) very non-intuitive unless you've worked with coroutines a lot (and 
>      almost nobody has)

Actually, coroutines are not too bad to program once you have a 
total-coverage async scheduler to run them. The attached (very sketchy) 
example uses libpcl ( http://www.xmailserver.org/libpcl.html ) and epoll 
as scheduler (but here you can really use anything). You can implement 
coroutines in many way, from C preprocessor macros up to anything, but in 
the libpcl case they are simply switched stacks. Like fibrils are supposed 
to be. The problem is that in order to make a real-life example of 
coroutine-based application work, you need everything that can put you at 
sleep (syscalls or any external library call you have no control on) 
implemented in an async way. And what I ended up doing is exactly what Zab 
did inside the kernel. In my case a dynamic pool of (userspace) threads 
servicing any non-native potentially pre-emptive call, and signaling the 
result to a pollable fd (pipe in my case) that is integrated in the epoll 
(poll/select whatever) scheduler.
I personally find Zab idea a really good one, since it allows for generic 
kernel async implementation, w/out the burden of dirtying kernel code 
paths with AIO knowledge. Being it fibrils or real kthreads, it is IMO 
definitely worth a very close look.




- Davide


[-- Attachment #2: Type: TEXT/x-csrc, Size: 2147 bytes --]


struct eph_conn {
	int sfd;
	unsigned int events, revents;
	coroutine_t co;
};



int eph_new_conn(int sfd, void *func) {
	struct eph_conn *conn;
	struct epoll_event ev;

	conn = (struct eph_conn *) malloc(sizeof(struct eph_conn));

	conn->sfd = sfd;
	conn->co = co_create(func, conn, NULL, STACKSIZE);

	ev.events = 0;
	ev.data.ptr = conn;
	epoll_ctl(kdpfd, EPOLL_CTL_ADD, sfd, &ev);

	co_call(conn->co);

	return 0;
}

void eph_exit_conn(struct eph_conn *conn) {
	struct epoll_event ev;

	epoll_ctl(kdpfd, EPOLL_CTL_DEL, conn->sfd, &ev);
	co_exit();
}

int eph_connect(struct eph_conn *conn, const struct sockaddr *serv_addr, socklen_t addrlen) {

	if (connect(conn->sfd, serv_addr, addrlen) == -1) {
		if (errno != EWOULDBLOCK && errno != EINPROGRESS)
			return -1;
		co_resume();
		if (conn->revents & (EPOLLERR | EPOLLHUP))
			return -1;
	}
	return 0;
}

int eph_read(struct eph_conn *conn, void *buf, int nbyte) {
	int n;

	while ((n = read(conn->sfd, buf, nbyte)) < 0) {
		if (errno == EINTR)
			continue;
		if (errno != EAGAIN && errno != EWOULDBLOCK)
			return -1;
		co_resume();
	}
	return n;
}

int eph_write(struct eph_conn *conn, void const *buf, int nbyte) {
	int n;

	while ((n = write(conn->sfd, buf, nbyte)) < 0) {
		if (errno == EINTR)
			continue;
		if (errno != EAGAIN && errno != EWOULDBLOCK)
			return -1;
		co_resume();
	}
	return n;
}

int eph_accept(struct eph_conn *conn, struct sockaddr *addr, int *addrlen) {
	int sfd;

	while ((sfd = accept(conn->sfd, addr, (socklen_t *) addrlen)) < 0) {
		if (errno == EINTR)
			continue;
		if (errno != EAGAIN && errno != EWOULDBLOCK)
			return -1;
		co_resume();
	}
	return sfd;
}

int eph_scheduler(int loop, long timeout) {
	int i, nfds;
	struct eph_conn *conn;
	struct epoll_event *cevents;

	do {
		nfds = epoll_wait(kdpfd, events, maxfds, timeout);

		for (i = 0, cevents = events; i < nfds; i++, cevents++) {
			conn = cevents->data.ptr;
			conn->revents = cevents->events;
			if (conn->revents & conn->events)
				co_call(conn->co);
		}
	} while (loop);

	return 0;
}


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 19:59           ` Alan
@ 2007-02-02 20:14             ` Linus Torvalds
  2007-02-02 20:58               ` Davide Libenzi
  2007-02-02 21:30               ` Alan
  2007-02-05 16:44             ` Zach Brown
  1 sibling, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 20:14 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise



On Fri, 2 Feb 2007, Alan wrote:
>
> This one got shelved while I sorted other things out as it warranted a
> longer look. Some comments follow, but firstly can we please bury this
> "fibril" name. The constructs Zach is using appear to be identical to
> co-routines, and they've been called that in computer science literature
> for fifty years. They are one of the great and somehow forgotten ideas.
> (and I admit I've used them extensively in past things where its
> wonderful for multi-player gaming so I'm a convert already).

Well, they are indeed coroutines, but they are coroutines in the same 
sense any "CPU scheduler" ends up being a coroutine.

They are NOT the generic co-routine that some languages support natively. 
So I think trying to call them coroutines would be even more misleading 
than calling them fibrils.

In other workds the whole *point* of the fibril is that you can do 
"coroutine-like stuff" while using a "normal functional linear programming 
paradign".

Wouldn't you agree?

(I love the concept of coroutines, but I absolutely detest what the code 
ends up looking like. There's a good reason why people program mostly in 
linear flow: that's how people think consciously - even if it's obviously 
not how the brain actually works).

And we *definitely* don't want to have a coroutine programming interface 
in the kernel. Not in C.

> The stuff however isn't as free as you make out. Current kernel logic
> knows about various things being "safe" but with fibrils you have to
> address additional questions such as "What happens if I issue an I/O and
> change priority". You also have an 800lb gorilla hiding behind a tree
> waiting for you in priviledge and permission checking.

This is why I think it should be 100% clear that things happen in process 
context. That just answers everything. If you want to synchronize with 
async events and change IO priority, you should do exactly that:

	wait_for_async();
	ioprio(newprority);

and that "solves" that problem. Leave it to user space.

> Right now current->*u/gid is safe across a syscall start to end, with an
> asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
> this, in fact we'd be able to do some of the utterly moronic poxix thread
> uid handling in kernel space if we did, just that it isn't free. We have
> locking rules defined by the magic serializing construct called
> "the syscall" and you break those.

I agree. As mentioned, we probably will have fallout. 

> The number of co-routines and stacks can be dealt with two ways - you use
> small stacks allocated when you create a fibril, or you grab a page, use
> separate IRQ stacks and either fail creation with -ENOBUFS etc which
> drops work on user space, or block (for which cases ??) which also means
> an overhead on co-routine exits. That can be tunable, for embedded easily
> tuned right down.

Right. It should be possible to just say "use a max parallelism factor of 
5", and if somebody submits a hundred AIO calls and they all block, when 
it hits #6, it will just do it synchronously.

Basically, what I'm hoping can come out of this (and this is a simplistic 
example, but perhaps exactly *because* of that it hopefully also shows 
that we canactually make *simple* interfaces for complex asynchronous 
things):

	struct one_entry *prev = NULL;
	struct dirent *de;

	while ((de = readdir(dir)) != NULL) {
		struct one_entry *entry = malloc(..);

		/* Add it to the list, fill in the name */
		entry->next = prev;
		prev = entry;
		strcpy(entry->name, de->d_name);

		/* Do the stat lookup async */
		async_stat(de->d_name, &entry->stat_buf);
	}
	wait_for_async();
	.. Ta-daa! All done ..

and it *should* allow us to do all the stat lookup asynchronously.

Done right, this should basically be no slower than doing it with a real 
stat() if everything was cached. That would kind of be the holy grail 
here.

> You get some other funny things from co-routines which are very powerful,
> very dangerous, or plain insane

You forgot "very hard to think about". 

We DO NOT want coroutines in general. It's clever, but it's
 (a) impossible to do without language support that C doesn't have, or 
     some really really horrid macro constructs that really only work for 
     very specific and simple cases.
 (b) very non-intuitive unless you've worked with coroutines a lot (and 
     almost nobody has)

> Linus will now tell me I'm out of my tree...

I don't think you're wrong in theory, I just thnk that in practice, 
withing the confines of (a) existing code, (b) existing languages, and (c) 
existing developers, we really REALLY don't want to expose coroutines as 
such.

But if you wanted to point out that what we want to do is get the 
ADVANTAGES of coroutines, without actually have to program them as such, 
then yes, I agree 100%. But we shouldn't call them coroutines, because the 
whole point is that as far as the user interface is concerned, they don't 
look like that. In the kernel, they just look like normal linear 
programming.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 15:56         ` Linus Torvalds
@ 2007-02-02 19:59           ` Alan
  2007-02-02 20:14             ` Linus Torvalds
  2007-02-05 16:44             ` Zach Brown
  2007-02-02 22:21           ` Ingo Molnar
  1 sibling, 2 replies; 92+ messages in thread
From: Alan @ 2007-02-02 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

This one got shelved while I sorted other things out as it warranted a
longer look. Some comments follow, but firstly can we please bury this
"fibril" name. The constructs Zach is using appear to be identical to
co-routines, and they've been called that in computer science literature
for fifty years. They are one of the great and somehow forgotten ideas.
(and I admit I've used them extensively in past things where its
wonderful for multi-player gaming so I'm a convert already).

The stuff however isn't as free as you make out. Current kernel logic
knows about various things being "safe" but with fibrils you have to
address additional questions such as "What happens if I issue an I/O and
change priority". You also have an 800lb gorilla hiding behind a tree
waiting for you in priviledge and permission checking.

Right now current->*u/gid is safe across a syscall start to end, with an
asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
this, in fact we'd be able to do some of the utterly moronic poxix thread
uid handling in kernel space if we did, just that it isn't free. We have
locking rules defined by the magic serializing construct called
"the syscall" and you break those.

I'd expect the odd other gorilla waiting to mug you as well and the ones
nobody has thought of will be the worst 8)

The number of co-routines and stacks can be dealt with two ways - you use
small stacks allocated when you create a fibril, or you grab a page, use
separate IRQ stacks and either fail creation with -ENOBUFS etc which
drops work on user space, or block (for which cases ??) which also means
an overhead on co-routine exits. That can be tunable, for embedded easily
tuned right down.

Traditional co-routines have clear notions of being able to create a
co-routine, stack them and fire up specific ones. In part this is done
because many things expressed in this way know what to fire up next. It's
also a very clean way to express driver problem with a lot of state

Essentially as a co-routine is simply making "%esp" roughly the same as
the C++ world's "self". 

You get some other funny things from co-routines which are very powerful,
very dangerous, or plain insane depending upon your view of life. One big
one is the ability for real men (and women) to do stuff like this,
because you don't need to keep the context attached to the same task.

	send_reset_command(dev);
	wait_for_irq_event(dev->irq);
	/* co-routine continues in IRQ context here */
	clean_up_reset_command(dev);
	exit_irq_event();
	/* co-routine continues out of IRQ context here */
	send_identify_command(dev);

Notice we just dealt with all the IRQ stack problems the moment an IRQ is
a co-routine transfer 8)

Ditto with timers, although for the kernel that might not be smart as we
have a lot of timers.

Less insanely you can create a context, start doing stuff in it and then
pass it to someone else local variables, state and all. This one is
actually rather useful for avoiding a lot of the 'D' state crap in the
kernel.

For example we have driver code that sleeps uninterruptibly because its
too hard to undo the mess and get out of the current state if it is
interrupted. In the world of sending other people co-routines you just do
this

	coroutine_set(MUST_COMPLETE);

and in exit

	foreach(coroutine)
		if(coroutine->flags & MUST_COMPLETE)
			inherit_coroutine(init, coroutine);

and obviously you don't pass any over that will then not do the right
thing before accessing user space (well unless implementing
'read_for_someone_else()' or other strange syscalls - like ptrace...)

Other questions really relate to the scheduling - Zach do you intend
schedule_fibrils() to be a call code would make or just from schedule() ?


Linus will now tell me I'm out of my tree...


Alan (who used to use Co-routines in real languages on 36bit
computers with 9bit bytes before learning C)

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 10:49       ` Ingo Molnar
@ 2007-02-02 15:56         ` Linus Torvalds
  2007-02-02 19:59           ` Alan
  2007-02-02 22:21           ` Ingo Molnar
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-02-02 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> My only suggestion was to have a couple of transparent kernel threads 
> (not fibrils) attached to a user context that does asynchronous 
> syscalls! Those kernel threads would be 'switched in' if the current 
> user-space thread blocks - so instead of having to 'create' any of them 
> - the fast path would be to /switch/ them to under the current 
> user-space, so that user-space processing can continue under that other 
> thread!

But in that case, you really do end up with "fibrils" anyway. 

Because those fibrils are what would be the state for the blocked system 
calls when they aren't scheduled.

We may have a few hundred thousand system calls a second (maybe that's not 
actually reasonable, but it should be what we *aim* for), and 99% of them 
will hopefully hit the cache and never need any separate IO, but even if 
it's just 1%, we're talking about thousands of threads.

I do _not_ think that it's reasonable to have thousands of threads state 
around just "in case". Especially if all those threadlets are then 
involved in signals etc - something that they are totally uninterested in. 

I think it's a lot more reasonable to have just the kernel stack page for 
"this was where I was when I blocked". IOW, a fibril-like thing. You need 
some data structure to set up the state *before* you start doing any 
threads at all, because hopefully the operation will be totally 
synchronous, and no secondary thread is ever really needed!

What I like about fibrils is that they should be able to handle the cached 
case well: the case where no "real" scheduling (just the fibril stack 
switches) takes place.

Now, most traditional DB loads would tend to use AIO only when they "know" 
that real IO will take place (the AIO call itself will probably be 
O_DIRECT most of the time). So I suspect that a lot of those users will 
never really have the cached case, but one of my hopes is to be able to do 
exactly the things that we have *not* done well: asynchronous file opens 
and pathname lookups, which is very common in a file server.

If done *really* right, a perfectly normal app could do things like 
asynchronous stat() calls to fill in the readdir results. In other words, 
what *I* would like to see is the ability to have something *really* 
simple like "ls" use this, without it actually being a performance hit 
for the common case where everythign is cached.

Have you done "ls -l" on a big uncached directory where the inodes 
are all over the disk lately? You can hear the disk whirr. THAT is the 
kind of "normal user" thing I'd like to be able to fix, and the db case is 
actually secondary. The DB case is much much more limited (ok, so somebody 
pointed out that they want slightly more than just read/write, but 
still.. We're talking "special code".)

> [ finally, i think you totally ignored my main argument, state machines.

I ignored your argument, because it's not really relevant. The fact that 
networking (and TCP in particular) has state machines is because it is a 
packetized environment. Nothing else is. Think pathname lookup etc. They 
are all *fundamentally* environments with a call stack.

So the state machine argument is totally bogus - it results in a 
programming model that simply doesn't match the *normal* setup. You want 
the kernel programming model to appear "linear" even when it isn't, 
because it's too damn hard to think nonlinearly.

Yes, we could do pathname lookup with that kind of insane setup too. But 
it would be HORRID!

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:19       ` Christoph Hellwig
  2007-02-01 13:52         ` Ingo Molnar
@ 2007-02-02 13:23         ` Andi Kleen
  1 sibling, 0 replies; 92+ messages in thread
From: Andi Kleen @ 2007-02-02 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

Christoph Hellwig <hch@infradead.org> writes:
> 
> I tend to agree.  Note that there is one thing we should be doing one
> one day (not only if we want to use it for aio) is to make kernel threads
> more lightweight.  Thereéis a lot of baggae we keep around in task_struct
> and co that only makes sense for threads that have a user space part and
> aren't or shouldn't be needed for a purely kernel-resistant thread.

I suspect you will get a lot of this for free from the current namespace
efforts.

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 13:19       ` Christoph Hellwig
  2007-02-01 21:52       ` Zach Brown
@ 2007-02-02 13:22       ` Andi Kleen
  2 siblings, 0 replies; 92+ messages in thread
From: Andi Kleen @ 2007-02-02 13:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

Ingo Molnar <mingo@elte.hu> writes:

> and for one of the most important IO 
> disciplines, networking, that is reality already.

Not 100% -- a few things in TCP/IP at least are blocking still.
Mostly relatively obscure things though.

Also the sockets model is currently incompatible with direct zero-copy RX/TX, 
which needs fixing.

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 20:07     ` Linus Torvalds
@ 2007-02-02 10:49       ` Ingo Molnar
  2007-02-02 15:56         ` Linus Torvalds
  0 siblings, 1 reply; 92+ messages in thread
From: Ingo Molnar @ 2007-02-02 10:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So stop blathering about scheduling costs, RT kernels and interrupts. 
> Interrupts generally happen a few thousand times a second. This is 
> soemthing you want to do a *million* times a second, without any IO 
> happening at all except for when it has to.

we might be talking past each other.

i never suggested every aio op should create/destroy a kernel thread!

My only suggestion was to have a couple of transparent kernel threads 
(not fibrils) attached to a user context that does asynchronous 
syscalls! Those kernel threads would be 'switched in' if the current 
user-space thread blocks - so instead of having to 'create' any of them 
- the fast path would be to /switch/ them to under the current 
user-space, so that user-space processing can continue under that other 
thread!

That means that in the 'switch kernel context' fastpath it simply needs 
to copy the blocked threads' user-space ptregs (~64 bytes) to its own 
kernel stack, and then it can do a return-from-syscall without 
user-space noticing the switch! Never would we really see the cost of 
kernel thread creation. We would never see that cost in the fully cached 
case (no other thread is needed then), nor would we see it in the 
blocking-IO case, due to pooling. (there are some other details related 
to things like the FPU context, but you get the idea.)

Let me quote Zach's reply to my suggestions:

| It'd certainly be doable to throw together a credible attempt to 
| service "asys" system call submission with full-on kernel threads. 
| That seems like reasonable due diligence to me.  If full-on threads 
| are almost as cheap, great. If fibrils are so much cheaper that they 
| seem to warrant investing in, great.

that's all i wanted to see being considered!

Please ignore my points about scheduling costs - i only talked about 
them at length because the only fundamental difference between kernel 
threads and fibrils is their /scheduling/ properties. /Not/ the 
setup/teardown costs - those are not relevant /precisely/ because they 
can be pooled and because they happen relatively rarely, compared to the 
cached case. The 'switch to the blocked thread's ptregs' operation also 
involves a context-switch under this design. That's why i was talking 
about scheduling so much: the /only/ true difference between fibrils and 
kernel threads is their /scheduling/.

I believe this is the point where your argument fails:

> - setup/teardown costs. Both memory and CPU. This is where the current
>   threads simply don't work. The setup cost of doing a clone/exit is 
>   actually much higher than the cost of doing the whole operation, 
>   most of the time.

you are comparing apples to oranges - i never said we should 
create/destroy a kernel thread for every async op. That would be insane!

what we need to support asynchronous system-calls is the ability to pick 
up an /already created/ kernel thread from a pool of per-task kernel 
threads and to switch it to under the current user-space and return to 
the user-space stack with that new kernel thread running. (The other, 
blocked kernel thread stays blocked and is returned into the pool of 
'pending' AIO kernel threads.) And this only needs to happen in the 
'cachemiss' case anyway. In the 'cached' case no other kernel thread 
would be involved at all, the current one just falls straight through 
the system-call.

my argument is that the whole notion of cutting this at the kernel stack 
and thread info level and making fibrils in essence a separate 
scheduling entitity is wrong, wrong, wrong. Why not use plain kernel 
threads for this?

[ finally, i think you totally ignored my main argument, state machines.
  The networking stack is a full and very nice state machine. It's
  kicked from user-space, and zillions of small contexts (sockets) are
  living on without any of the originating tasks having to be involved.
  So i'm still holding to the fundamental notion that within the kernel
  this form of AIO is a nice but /secondary/ mechanism. If a subsystem
  is able to pull it off, it can implement asynchronity via a state
  machine - and it will outperform any thread based AIO. Or not. We'll
  see. For something like the VFS i doubt we'll see (and i doubt we
  /want/ to see) a 'native' state-machine implementation.

  this is btw. quite close to the Tux model of doing asynchronous block
  IO and asynchronous VFS events such as asynchronous open(). Tux uses a
  pool of kernel threads to pass blocking work to, while not holding up
  the 'main' thread. But the main Tux speedup comes from having a native
  state machine for all the networking IO. ]

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 22:23         ` Benjamin LaHaise
@ 2007-02-01 22:37           ` Zach Brown
  0 siblings, 0 replies; 92+ messages in thread
From: Zach Brown @ 2007-02-01 22:37 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ingo Molnar, linux-kernel, linux-aio, Suparna Bhattacharya,
	Linus Torvalds

> Priorities cannot be shared, as they have to adapt to the per-request
> priority when we get down to the nitty gitty of POSIX AIO, as  
> otherwise
> realtime issues like keepalive transmits will be handled incorrectly.

Well, maybe not *blind* sharing.  But something more than the  
disconnect threads currently have with current->ioprio.

Today an existing kernel thread would most certainly ignore a  
sys_ioprio_set() in the submitter and then handle an aio syscall with  
an old current->ioprio.

Something more smart than that is all I'm on about.

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 21:52       ` Zach Brown
@ 2007-02-01 22:23         ` Benjamin LaHaise
  2007-02-01 22:37           ` Zach Brown
  0 siblings, 1 reply; 92+ messages in thread
From: Benjamin LaHaise @ 2007-02-01 22:23 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, linux-kernel, linux-aio, Suparna Bhattacharya,
	Linus Torvalds

On Thu, Feb 01, 2007 at 01:52:13PM -0800, Zach Brown wrote:
> >let me clarify this: i very much like your AIO patchset in general, in
> >the sense that it 'completes' the AIO implementation: finally  
> >everything
> >can be done via it, greatly increasing its utility and hopefully its
> >penetration. This is the most important step, by far.
> 
> We violently agree on this :).

There is also the old kernel_thread based method that should probably be 
compared, especially if pre-created threads are thrown into the mix.  Also, 
since the old days, a lot of thread scaling issues have been fixed that 
could even make userland threads more viable.

> Would your strategy be to update the syscall implementations to share  
> data in task_struct so that there isn't as significant a change in  
> behaviour?  (sharing current->ioprio, instead if just inheriting it,  
> for example.).  We'd be betting that there would be few of these and  
> that they'd be pretty reasonable to share?

Priorities cannot be shared, as they have to adapt to the per-request 
priority when we get down to the nitty gitty of POSIX AIO, as otherwise 
realtime issues like keepalive transmits will be handled incorrectly.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 13:19       ` Christoph Hellwig
@ 2007-02-01 21:52       ` Zach Brown
  2007-02-01 22:23         ` Benjamin LaHaise
  2007-02-02 13:22       ` Andi Kleen
  2 siblings, 1 reply; 92+ messages in thread
From: Zach Brown @ 2007-02-01 21:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

> let me clarify this: i very much like your AIO patchset in general, in
> the sense that it 'completes' the AIO implementation: finally  
> everything
> can be done via it, greatly increasing its utility and hopefully its
> penetration. This is the most important step, by far.

We violently agree on this :).

> what i dont really like /the particular/ concept above - the
> introduction of 'fibrils' as a hard distinction of kernel threads.  
> They
> are /almost/ kernel threads, but still by being different they create
> alot of duplication and miss out on a good deal of features that  
> kernel
> threads have naturally.

I might quibble with some of the details, but I understand your  
fundamental concern.  I do.  I don't get up each morning *thrilled*  
by the idea of having to update lockdep, sysrq-t, etc, to understand  
these fibril things :).  The current fibril switch isn't nearly as  
clever as the lock-free task scheduling switch.  It'd be nice if we  
didn't have to do that work to optimize the hell out of it, sure.

> It kind of hurts to say this because i'm usually quite concept-happy -
> one can easily get addicted to the introduction of new core kernel
> concepts :-)

:)

> so my suggestions center around the notion of extending kernel threads
> to support the features you find important in fibrils:
>
>> would it be hard to redo your AIO patches based on a pool of plain
>> simple kernel threads?

It'd certainly be doable to throw together a credible attempt to  
service "asys" system call submission with full-on kernel threads.   
That seems like reasonable due diligence to me.  If full-on threads  
are almost as cheap, great.  If fibrils are so much cheaper that they  
seem to warrant investing in, great.

I am concerned about the change in behaviour if we fall back to full  
kernel threads, though.  I really, really, want aio syscalls to  
behave just like sync ones.

Would your strategy be to update the syscall implementations to share  
data in task_struct so that there isn't as significant a change in  
behaviour?  (sharing current->ioprio, instead if just inheriting it,  
for example.).  We'd be betting that there would be few of these and  
that they'd be pretty reasonable to share?

- z

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01  8:36   ` Ingo Molnar
  2007-02-01 13:02     ` Ingo Molnar
@ 2007-02-01 20:07     ` Linus Torvalds
  2007-02-02 10:49       ` Ingo Molnar
  1 sibling, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-02-01 20:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Thu, 1 Feb 2007, Ingo Molnar wrote:
> 
> there's almost no scheduling cost from being able to arbitrarily 
> schedule a kernel thread - but there are /huge/ benefits in it.

That's a singularly *stupid* argument.

Of course scheduling is fast. That's the whole *point* of fibrils. They 
still schedule. Nobody claimed anything else. 

Bringing up RT kernels and scheduling latency is idiotic. It's like saying 
"we should do this because the sky is blue". Sure, that's true, but what 
the *hell* does raleigh scattering have to do with anything?

The cost has _never_ been scheduling. That was never the point. Why do you 
even bring it up? Only to make an argument that makes no sense?

The cost of AIO is

 - maintenance. It'sa separate code-path, and it's one that simply doesn't 
   fit into anything else AT ALL. It works (mostly) for simple things, ie 
   reads and writes, but even there, it's really adding a lot of crud that 
   we could do without.

 - setup and teardown costs: both in CPU and in memory. These are the big 
   costs. It's especially true since a lot of AIO actually ends up cached. 
   The user program just wants the data - 99% of the time it's likely to 
   be there, and the whole point of AIO is to get at it cheaply, but not 
   block if it's not there.

So your scheduling arguments are inane. They totally miss the point. They 
have nothing to do with *anything*.

Ingo: everybody *agrees* that scheduling is cheap. Scheduling isn't the 
issue. Scheduling isn't even needed in the perfect path where the AIO 
didn't need to do any real IO (and that _is_ the path we actually would 
like to optimize most).

So instead of talking about totally irrelevant things, please keep your 
eyes on the ball.

So I claim that the ball is here:

 - cached data (and that is *espectally* true of some of the more 
   interesting things we can do with a more generic AIO thing: path 
   lookup, inode filling (stat/fstat) etc usually has hit-rates in the 99% 
   range, but missing even just 1% of the time can be deadly, if the miss 
   costs you a hundred msec of not doing anythign else!

   Do the math. A "stat()" system call generally takes on the other of a 
   couple of microseconds. But if it misses even just 1% of the time (and 
   takes 100 msec when it does that, because there is other IO also 
   competing for the disk arm), ON AVERAGE it takes 1ms. 

   So what you should aim for is improving that number. The cached case 
   should hopefully still be in the microseconds, and the uncached case 
   should be nonblocking for the caller.

 - setup/teardown costs. Both memory and CPU. This is where the current 
   threads simply don't work. The setup cost of doing a clone/exit is 
   actually much higher than the cost of doing the whole operation, most 
   of the time. Remember: caches still work.

 - maintenance. Clearly AIO will always have some special code, but if we 
   can move the special code *away* from filesystems and networking and 
   all the thousands of device drivers, and into core kernel code, we've 
   done something good. And if we can extend it from just pure read/write 
   into just about *anything*, then people will be happy.

So stop blathering about scheduling costs, RT kernels and interrupts. 
Interrupts generally happen a few thousand times a second. This is 
soemthing you want to do a *million* times a second, without any IO 
happening at all except for when it has to.

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 17:13           ` Mark Lord
@ 2007-02-01 18:02             ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2007-02-01 18:02 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds


* Mark Lord <lkml@rtr.ca> wrote:

> >also, we context-switch kernel threads in 350 nsecs on current 
> >hardware and the -rt kernel is certainly happy with that and runs all 
> >hardirqs
> 
> Ingo, how relevant is that "350 nsecs on current hardware" claim?
> 
> I don't mean that in a bad way, but my own experience suggests that 
> most people doing real hard RT (or tight soft RT) are not doing it on 
> x86 architectures.  But rather on lowly 1GHz (or less) ARM based 
> processors and the like.

it's not relevant to those embedded boards, but it's relevant to the AIO 
discussion, which centers around performance.

> For RT issues, those are the platforms I care more about, as those are 
> the ones that get embedded into real-time devices.

yeah. Nevertheless if you want to use -rt on your desktop (under Fedora 
4/5/6) you can track an rpmized+distroized full kernel package quite 
easily, via 3 easy commands:

   cd /etc/yum.repos.d
   wget http://people.redhat.com/~mingo/realtime-preempt/rt.repo

   yum install kernel-rt.x86_64   # on x86_64
   yum install kernel-rt          # on i686

which is closely tracking latest upstream -git. (for example, the 
current kernel-rt-2.6.20-rc7.1.rt3.0109.i686.rpm is based on 
2.6.20-rc7-git1, so if you want to run a kernel rpm that has all of 
Linus' latest commits from yesterday, this might be for you.)

it's rumored to be a quite smooth kernel ;-) So in this sense, because 
this also runs on all my testboxes by default, it matters on modern 
hardware too, at least to me. Today's commodity hardware is tomorrow's 
embedded hardware. If a kernel is good on today's colorful desktop 
hardware then it will be perfect for tomorrow's embedded hardware.

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:52         ` Ingo Molnar
@ 2007-02-01 17:13           ` Mark Lord
  2007-02-01 18:02             ` Ingo Molnar
  0 siblings, 1 reply; 92+ messages in thread
From: Mark Lord @ 2007-02-01 17:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

Ingo Molnar wrote:
>
> also, we context-switch kernel threads in 350 nsecs on current hardware 
> and the -rt kernel is certainly happy with that and runs all hardirqs 

Ingo, how relevant is that "350 nsecs on current hardware" claim?

I don't mean that in a bad way, but my own experience suggests that
most people doing real hard RT (or tight soft RT) are not doing it
on x86 architectures.  But rather on lowly 1GHz (or less) ARM based
processors and the like.

For RT issues, those are the platforms I care more about,
as those are the ones that get embedded into real-time devices.

??

Cheers

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:19       ` Christoph Hellwig
@ 2007-02-01 13:52         ` Ingo Molnar
  2007-02-01 17:13           ` Mark Lord
  2007-02-02 13:23         ` Andi Kleen
  1 sibling, 1 reply; 92+ messages in thread
From: Ingo Molnar @ 2007-02-01 13:52 UTC (permalink / raw)
  To: Christoph Hellwig, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds


* Christoph Hellwig <hch@infradead.org> wrote:

> I tend to agree.  Note that there is one thing we should be doing one 
> one day (not only if we want to use it for aio) is to make kernel 
> threads more lightweight.  There a lot of baggae we keep around in 
> task_struct and co that only makes sense for threads that have a user 
> space part and aren't or shouldn't be needed for a purely 
> kernel-resistant thread.

yeah. I'm totally open to such efforts. I'd also be most happy if this 
was primarily driven via the KAIO effort: i.e. to implement it via 
kernel threads and then to benchmark the hell out of it. I volunteer to 
fix whatever fat kernel thread handling has left.

and if people agree with me that 'native' state-machine driven KAIO is 
where we want to ultimately achieve (it is certainly the best performing 
implementation) then i dont see the point in fibrils as an interim 
mechanism anyway. Lets just hide AIO complexities from userspace via 
kernel threads, and optimize this via two methods: by making kernel 
threads faster, and by simultaneously and gradually converting as much 
KAIO code to a native state machine - which would not need any kind of 
kernel thread help anyway.

(plus as i mentioned previously, co-scheduling kernel threads with 
related user space threads on the same CPU might be something useful too 
- not just for KAIO, and we could add that too.)

also, we context-switch kernel threads in 350 nsecs on current hardware 
and the -rt kernel is certainly happy with that and runs all hardirqs 
and softirqs in separate kernel thread contexts. There's not /that/ much 
fat left to cut off - and if there's something more to optimize there 
then there are a good number of projects interested in that, not just 
the KAIO effort :)

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:02     ` Ingo Molnar
@ 2007-02-01 13:19       ` Christoph Hellwig
  2007-02-01 13:52         ` Ingo Molnar
  2007-02-02 13:23         ` Andi Kleen
  2007-02-01 21:52       ` Zach Brown
  2007-02-02 13:22       ` Andi Kleen
  2 siblings, 2 replies; 92+ messages in thread
From: Christoph Hellwig @ 2007-02-01 13:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1080 bytes --]

On Thu, Feb 01, 2007 at 02:02:34PM +0100, Ingo Molnar wrote:
> what i dont really like /the particular/ concept above - the 
> introduction of 'fibrils' as a hard distinction of kernel threads. They 
> are /almost/ kernel threads, but still by being different they create 
> alot of duplication and miss out on a good deal of features that kernel 
> threads have naturally.
> 
> It kind of hurts to say this because i'm usually quite concept-happy - 
> one can easily get addicted to the introduction of new core kernel 
> concepts :-) But i really, really think we dont want to do fibrils but 
> we want to do kernel threads, and i havent really seen a discussion 
> about why they shouldnt be done via kernel threads.

I tend to agree.  Note that there is one thing we should be doing one
one day (not only if we want to use it for aio) is to make kernel threads
more lightweight.  Thereéis a lot of baggae we keep around in task_struct
and co that only makes sense for threads that have a user space part and
aren't or shouldn't be needed for a purely kernel-resistant thread.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01  8:36   ` Ingo Molnar
@ 2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 13:19       ` Christoph Hellwig
                         ` (2 more replies)
  2007-02-01 20:07     ` Linus Torvalds
  1 sibling, 3 replies; 92+ messages in thread
From: Ingo Molnar @ 2007-02-01 13:02 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> * Zach Brown <zach.brown@oracle.com> wrote:
> 
> > This patch introduces the notion of a 'fibril'.  It's meant to be a 
> > lighter kernel thread. [...]
> 
> as per my other email, i dont really like this concept. This is the 
> killer:

let me clarify this: i very much like your AIO patchset in general, in 
the sense that it 'completes' the AIO implementation: finally everything 
can be done via it, greatly increasing its utility and hopefully its 
penetration. This is the most important step, by far.

what i dont really like /the particular/ concept above - the 
introduction of 'fibrils' as a hard distinction of kernel threads. They 
are /almost/ kernel threads, but still by being different they create 
alot of duplication and miss out on a good deal of features that kernel 
threads have naturally.

It kind of hurts to say this because i'm usually quite concept-happy - 
one can easily get addicted to the introduction of new core kernel 
concepts :-) But i really, really think we dont want to do fibrils but 
we want to do kernel threads, and i havent really seen a discussion 
about why they shouldnt be done via kernel threads.

Nor have i seen a discussion that whatever threading concept we use for 
AIO within the kernel, it is really a fallback thing, not the primary 
goal of "native" KAIO design. The primary goal of KAIO design is to 
arrive at a state machine - and for one of the most important IO 
disciplines, networking, that is reality already. (For filesystem events 
i doubt we will ever be able to build an IO state machine - but there 
are lots of crazy folks out there so it's not fundamentally impossible, 
just very, very hard.)

so my suggestions center around the notion of extending kernel threads 
to support the features you find important in fibrils:

> would it be hard to redo your AIO patches based on a pool of plain 
> simple kernel threads?
> 
> We could even extend the scheduling properties of kernel threads so 
> that they could also be 'companion threads' of any given user-space 
> task. (i.e. they'd always schedule on the same CPu as that user-space 
> task)
> 
> I bet most of the real benefit would come from co-scheduling them on 
> the same CPU. But this should be a performance property, not a basic 
> design property. (And i also think that having a limited per-CPU pool 
> of AIO threads works better than having a per-user-thread pool - but 
> again this is a detail that can be easily changed, not a fundamental 
> design property.)

but i'm willing to be convinced of the opposite as well, as always. (I'm 
real good at quickly changing my mind, especially when i'm embarrasingly 
wrong about something. So please fire away and dont hold back.)

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
@ 2007-02-01  8:36   ` Ingo Molnar
  2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 20:07     ` Linus Torvalds
  2007-02-04  5:12   ` Davide Libenzi
  1 sibling, 2 replies; 92+ messages in thread
From: Ingo Molnar @ 2007-02-01  8:36 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds


* Zach Brown <zach.brown@oracle.com> wrote:

> This patch introduces the notion of a 'fibril'.  It's meant to be a 
> lighter kernel thread. [...]

as per my other email, i dont really like this concept. This is the 
killer:

> [...]  There can be multiple of them in the process of executing for a 
> given task_struct, but only one can every be actively running at a 
> time. [...]

there's almost no scheduling cost from being able to arbitrarily 
schedule a kernel thread - but there are /huge/ benefits in it.

would it be hard to redo your AIO patches based on a pool of plain 
simple kernel threads?

We could even extend the scheduling properties of kernel threads so that 
they could also be 'companion threads' of any given user-space task. 
(i.e. they'd always schedule on the same CPu as that user-space task)

I bet most of the real benefit would come from co-scheduling them on the 
same CPU. But this should be a performance property, not a basic design 
property. (And i also think that having a limited per-CPU pool of AIO 
threads works better than having a per-user-thread pool - but again this 
is a detail that can be easily changed, not a fundamental design 
property.)

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
@ 2007-01-30 20:39 ` Zach Brown
  2007-02-01  8:36   ` Ingo Molnar
  2007-02-04  5:12   ` Davide Libenzi
  0 siblings, 2 replies; 92+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

This patch introduces the notion of a 'fibril'.  It's meant to be a lighter
kernel thread.  There can be multiple of them in the process of executing for a
given task_struct, but only one can every be actively running at a time.  Think
of it as a stack and some metadata for scheduling them inside the task_stuct.

This implementation is wildly architecture-specific but isn't put in the right
places.  Since these are not code paths that I have extensive experience with,
I focused more on on getting it going and representative of the concept than on
making it right on the first try.  I'm actively interested in feedback from
people who know more about the places this touches.

The fibril struct itself is left stand-alone for clarity.  There is a 1:1
relationship between fibrils and struct thread_info, though, so it might make
more sense to embed the two somehow.

The use of list_head for the run queue is simplistic.  As long as we're not
removing specific fibrils from the list, which seems unlikely, we be more
clever.  Maybe no more clever than a singly-linked list, though.

Fibril management is under the runqueue lock because that ends up working well
for the wake-up path as well.  In the current patch, though, it makes for some
pretty sloppy code for unlocking the runqueue lock (and re-enabling interrupts
and pre-emption) on the other side of the switch.

The actual mechanics of switching from one stack to another at the end of
schedule_fibril() makes me nervous.  I'm not convinced that blindly copying the
contents of thread_info from the previous to the next stack is safe, even if
done with interrupts disabled.  (NMIs?)  The juggling of current->thread_info
might be racy, etc.

diff -r 26e278468209 -r df7bc026d50e arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/arch/i386/kernel/process.c	Mon Jan 29 15:36:16 2007 -0800
@@ -698,6 +698,28 @@ struct task_struct fastcall * __switch_t
 	return prev_p;
 }
 
+/*
+ * We've just switched the stack and instruction pointer to point to a new
+ * fibril.  We were called from schedule() -> schedule_fibril() with the
+ * runqueue lock held _irq and with preemption disabled.
+ *
+ * We let finish_fibril_switch() unwind the state that was built up by
+ * our callers.  We do that here so that we don't need to ask fibrils to
+ * first execute something analagous to schedule_tail(). Maybe that's
+ * wrong.
+ *
+ * We'd also have to reacquire the kernel lock here.  For now we know the
+ * BUG_ON(lock_depth) prevents us from having to worry about it.
+ */
+void fastcall __switch_to_fibril(struct thread_info *ti)
+{
+	finish_fibril_switch();
+
+	/* free the ti if schedule_fibril() told us that it's done */
+	if (ti->status & TS_FREE_AFTER_SWITCH)
+		free_thread_info(ti);
+}
+
 asmlinkage int sys_fork(struct pt_regs regs)
 {
 	return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/system.h
--- a/include/asm-i386/system.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/system.h	Mon Jan 29 15:36:16 2007 -0800
@@ -31,6 +31,31 @@ extern struct task_struct * FASTCALL(__s
 		      "=a" (last),"=S" (esi),"=D" (edi)			\
 		     :"m" (next->thread.esp),"m" (next->thread.eip),	\
 		      "2" (prev), "d" (next));				\
+} while (0)
+
+struct thread_info;
+void fastcall __switch_to_fibril(struct thread_info *ti);
+
+/*
+ * This is called with the run queue lock held _irq and with preemption
+ * disabled.  __switch_to_fibril drops those.
+ */
+#define switch_to_fibril(prev, next, ti) do {				\
+	unsigned long esi,edi;						\
+	asm volatile("pushfl\n\t"		/* Save flags */	\
+		     "pushl %%ebp\n\t"					\
+		     "movl %%esp,%0\n\t"	/* save ESP */		\
+		     "movl %4,%%esp\n\t"	/* restore ESP */	\
+		     "movl $1f,%1\n\t"		/* save EIP */		\
+		     "pushl %5\n\t"		/* restore EIP */	\
+		     "jmp __switch_to_fibril\n"				\
+		     "1:\t"						\
+		     "popl %%ebp\n\t"					\
+		     "popfl"						\
+		     :"=m" (prev->esp),"=m" (prev->eip),		\
+		      "=S" (esi),"=D" (edi)				\
+		     :"m" (next->esp),"m" (next->eip),			\
+		      "d" (prev), "a" (ti));				\
 } while (0)
 
 #define _set_base(addr,base) do { unsigned long __pr; \
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/thread_info.h
--- a/include/asm-i386/thread_info.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/thread_info.h	Mon Jan 29 15:36:16 2007 -0800
@@ -91,6 +91,12 @@ static inline struct thread_info *curren
 static inline struct thread_info *current_thread_info(void)
 {
 	return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+}
+
+/* XXX perhaps should be integrated with task_pt_regs(task) */
+static inline struct pt_regs *thread_info_pt_regs(struct thread_info *info)
+{
+	return (struct pt_regs *)(KSTK_TOP(info)-8) - 1;
 }
 
 /* thread information allocation */
@@ -169,6 +175,7 @@ static inline struct thread_info *curren
  */
 #define TS_USEDFPU		0x0001	/* FPU was used by this task this quantum (SMP) */
 #define TS_POLLING		0x0002	/* True if in idle loop and not sleeping */
+#define TS_FREE_AFTER_SWITCH	0x0004	/* free ti in __switch_to_fibril() */
 
 #define tsk_is_polling(t) ((t)->thread_info->status & TS_POLLING)
 
diff -r 26e278468209 -r df7bc026d50e include/linux/init_task.h
--- a/include/linux/init_task.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/init_task.h	Mon Jan 29 15:36:16 2007 -0800
@@ -111,6 +111,8 @@ extern struct group_info init_groups;
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
+	.fibril		= NULL,						\
+	.runnable_fibrils = LIST_HEAD_INIT(tsk.runnable_fibrils),	\
 	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\
 	.ioprio		= 0,						\
 	.time_slice	= HZ,						\
diff -r 26e278468209 -r df7bc026d50e include/linux/sched.h
--- a/include/linux/sched.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/sched.h	Mon Jan 29 15:36:16 2007 -0800
@@ -812,6 +812,38 @@ enum sleep_type {
 
 struct prio_array;
 
+/*
+ * A 'fibril' is a very small fiber.  It's used here to mean a small thread.
+ *
+ * (Chosing a weird new name avoided yet more overloading of 'task', 'call',
+ * 'thread', 'stack', 'fib{er,re}', etc).
+ *
+ * This structure is used by the schduler to track multiple executing stacks
+ * inside a task_struct.
+ *
+ * Only one fibril executes for a given task_struct at a time.  When it
+ * blocks, however, another fibril has the chance to execute while it sleeps.
+ * This means that call chains executing in fibrils can see concurrent
+ * current-> accesses at blocking points.  "per_call_chain()" members are
+ * switched along with the fibril, so they remain local.  Preemption *will not*
+ * trigger a fibril switch.
+ *
+ * XXX
+ *  - arch specific
+ */
+struct fibril {
+	struct list_head		run_list;
+	/* -1 unrunnable, 0 runnable, >0 stopped */
+	long				state;
+	unsigned long			eip;
+	unsigned long			esp;
+	struct thread_info		*ti;
+	struct per_call_chain_storage	per_call;
+};
+
+void sched_new_runnable_fibril(struct fibril *fibril);
+void finish_fibril_switch(void);
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -857,6 +889,20 @@ struct task_struct {
 	struct list_head ptrace_list;
 
 	struct mm_struct *mm, *active_mm;
+
+	/*
+	 * The scheduler uses this to determine if the current call is a
+	 * stand-alone task or a fibril.  If it's a fibril then wake-ups
+	 * will target the fibril and a schedule() might result in swapping
+	 * in another runnable fibril.  So to start executing fibrils at all
+	 * one allocates a fibril to represent the running task and then
+	 * puts initialized runnable fibrils in the run list.
+	 *
+	 * The state members of the fibril and runnable_fibrils list are
+	 * managed under the task's run queue lock.
+	 */
+	struct fibril *fibril;
+	struct list_head runnable_fibrils;
 
 /* task state */
 	struct linux_binfmt *binfmt;
diff -r 26e278468209 -r df7bc026d50e kernel/exit.c
--- a/kernel/exit.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/exit.c	Mon Jan 29 15:36:16 2007 -0800
@@ -854,6 +854,13 @@ fastcall NORET_TYPE void do_exit(long co
 {
 	struct task_struct *tsk = current;
 	int group_dead;
+
+	/* 
+	 * XXX this is just a debug helper, this should be waiting for all
+	 * fibrils to return.  Possibly after sending them lots of -KILL
+	 * signals?
+	 */
+	BUG_ON(!list_empty(&current->runnable_fibrils));
 
 	profile_task_exit(tsk);
 
diff -r 26e278468209 -r df7bc026d50e kernel/fork.c
--- a/kernel/fork.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/fork.c	Mon Jan 29 15:36:16 2007 -0800
@@ -1179,6 +1179,9 @@ static struct task_struct *copy_process(
 
 	/* for sys_ioprio_set(IOPRIO_WHO_PGRP) */
 	p->ioprio = current->ioprio;
+
+	p->fibril = NULL;
+	INIT_LIST_HEAD(&p->runnable_fibrils);
 
 	/*
 	 * The task hasn't been attached yet, so its cpus_allowed mask will
diff -r 26e278468209 -r df7bc026d50e kernel/sched.c
--- a/kernel/sched.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/sched.c	Mon Jan 29 15:36:16 2007 -0800
@@ -3407,6 +3407,111 @@ static inline int interactive_sleep(enum
 }
 
 /*
+ * This unwinds the state that was built up by schedule -> schedule_fibril().
+ * The arch-specific switch_to_fibril() path calls here once the new fibril
+ * is executing.
+ */
+void finish_fibril_switch(void)
+{
+	spin_unlock_irq(&this_rq()->lock);
+	preempt_enable_no_resched();
+}
+
+/*
+ * Add a new fibril to the runnable list.  It'll be switched to next time
+ * the caller comes through schedule().
+ */
+void sched_new_runnable_fibril(struct fibril *fibril)
+{
+	struct task_struct *tsk = current;
+	unsigned long flags;
+	struct rq *rq = task_rq_lock(tsk, &flags);
+
+	fibril->state = TASK_RUNNING;
+	BUG_ON(!list_empty(&fibril->run_list));
+	list_add_tail(&fibril->run_list, &tsk->runnable_fibrils);
+
+	task_rq_unlock(rq, &flags);
+}
+
+/*
+ * This is called from schedule() when we're not being preempted and there is a
+ * fibril waiting in current->runnable_fibrils.
+ *
+ * This is called under the run queue lock to serialize fibril->state and the
+ * runnable_fibrils list with wake-up.  We drop it before switching and the
+ * return path takes that into account.
+ *
+ * We always switch so that a caller can specifically make a single pass
+ * through the runnable fibrils by marking itself _RUNNING and calling
+ * schedule().
+ */
+void schedule_fibril(struct task_struct *tsk)
+{
+	struct thread_info *ti = task_thread_info(tsk);
+	struct fibril *prev;
+	struct fibril *next;
+	struct fibril dummy;
+
+	/*
+	 * XXX We don't deal with the kernel lock yet.  It'd need to be audited
+	 * and lock_depth moved under per_call_chain().
+	 */
+	BUG_ON(tsk->lock_depth >= 0);
+
+	next = list_entry(current->runnable_fibrils.next, struct fibril,
+			 run_list);
+	list_del_init(&next->run_list);
+	BUG_ON(next->state != TASK_RUNNING);
+
+	prev = tsk->fibril;
+	if (prev) {
+		prev->state = tsk->state;
+		prev->per_call = current->per_call;
+		/*
+		 * This catches the case where the caller wants to make a pass
+		 * through runnable fibrils by marking itself _RUNNING and
+		 * calling schedule().  A fibril should not be able to be on
+		 * both tsk->fibril and the runnable_list.
+		 */
+		if (prev->state == TASK_RUNNING) {
+			BUG_ON(!list_empty(&prev->run_list));
+			list_add_tail(&prev->run_list,
+				      &current->runnable_fibrils);
+		}
+	} else {
+		/*
+		 * To free a fibril the calling path can free the structure
+		 * itself, set current->fibril to NULL, and call schedule().
+		 * That causes us to tell __switch_to_fibril() to free the ti
+		 * associated with the fibril once we've switched away from it.
+		 * The dummy is just use to give switch_to_fibril() something
+		 * to save state in to.
+		 */
+		prev = &dummy;
+	}
+
+	/* 
+	 * XXX The idea is to copy all but the actual call stack.  Obviously
+	 * this is wildly arch-specific and belongs abstracted out.
+	 */
+	*next->ti = *ti;
+	*thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
+
+	current->thread_info = next->ti;
+	current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
+	current->fibril = next;
+	current->state = next->state;
+	current->per_call = next->per_call;
+
+	if (prev == &dummy)
+		ti->status |= TS_FREE_AFTER_SWITCH;
+
+	/* __switch_to_fibril() drops the runqueue lock and enables preempt */
+	switch_to_fibril(prev, next, ti);
+}
+
+/*
  * schedule() is the main scheduler function.
  */
 asmlinkage void __sched schedule(void)
@@ -3468,6 +3573,22 @@ need_resched_nonpreemptible:
 	run_time /= (CURRENT_BONUS(prev) ? : 1);
 
 	spin_lock_irq(&rq->lock);
+
+	/* always switch to a runnable fibril if we aren't being preempted */
+	if (unlikely(!(preempt_count() & PREEMPT_ACTIVE) &&
+		     !list_empty(&prev->runnable_fibrils))) {
+		schedule_fibril(prev);
+		/* 
+		 * finish_fibril_switch() drops the rq lock and enables
+		 * premption, but the popfl disables interrupts again.  Watch
+		 * me learn how context switch locking works before your very
+		 * eyes!  XXX This will need to be fixed up by throwing
+		 * together something like the prepare_lock_switch() path the
+		 * scheduler does.  Guidance appreciated!
+		 */
+		local_irq_enable();
+		return;
+	}
 
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2007-02-07  9:37 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-06 13:43 [PATCH 2 of 4] Introduce i386 fibril scheduling Al Boldi
  -- strict thread matches above, loose matches on Subject: below --
2007-02-03 14:05 linux
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01  8:36   ` Ingo Molnar
2007-02-01 13:02     ` Ingo Molnar
2007-02-01 13:19       ` Christoph Hellwig
2007-02-01 13:52         ` Ingo Molnar
2007-02-01 17:13           ` Mark Lord
2007-02-01 18:02             ` Ingo Molnar
2007-02-02 13:23         ` Andi Kleen
2007-02-01 21:52       ` Zach Brown
2007-02-01 22:23         ` Benjamin LaHaise
2007-02-01 22:37           ` Zach Brown
2007-02-02 13:22       ` Andi Kleen
2007-02-01 20:07     ` Linus Torvalds
2007-02-02 10:49       ` Ingo Molnar
2007-02-02 15:56         ` Linus Torvalds
2007-02-02 19:59           ` Alan
2007-02-02 20:14             ` Linus Torvalds
2007-02-02 20:58               ` Davide Libenzi
2007-02-02 21:09                 ` Linus Torvalds
2007-02-02 21:30               ` Alan
2007-02-02 21:30                 ` Linus Torvalds
2007-02-02 22:42                   ` Ingo Molnar
2007-02-02 23:01                     ` Linus Torvalds
2007-02-02 23:17                       ` Linus Torvalds
2007-02-03  0:04                         ` Alan
2007-02-03  0:23                         ` bert hubert
2007-02-02 22:48                   ` Alan
2007-02-05 16:44             ` Zach Brown
2007-02-02 22:21           ` Ingo Molnar
2007-02-02 22:49             ` Linus Torvalds
2007-02-02 23:55               ` Ingo Molnar
2007-02-03  0:56                 ` Linus Torvalds
2007-02-03  7:15                   ` Suparna Bhattacharya
2007-02-03  8:23                   ` Ingo Molnar
2007-02-03  9:25                     ` Matt Mackall
2007-02-03 10:03                       ` Ingo Molnar
2007-02-05 17:44                     ` Zach Brown
2007-02-05 19:26                       ` Davide Libenzi
2007-02-05 19:41                         ` Zach Brown
2007-02-05 20:10                           ` Davide Libenzi
2007-02-05 20:21                             ` Zach Brown
2007-02-05 20:42                               ` Linus Torvalds
2007-02-05 20:39                             ` Linus Torvalds
2007-02-05 21:09                               ` Davide Libenzi
2007-02-05 21:31                                 ` Kent Overstreet
2007-02-06 20:25                                   ` Davide Libenzi
2007-02-06 20:46                                   ` Linus Torvalds
2007-02-06 21:16                                     ` David Miller
2007-02-06 21:28                                       ` Linus Torvalds
2007-02-06 21:31                                         ` David Miller
2007-02-06 21:46                                           ` Eric Dumazet
2007-02-06 21:50                                           ` Linus Torvalds
2007-02-06 22:28                                             ` Zach Brown
2007-02-06 22:45                                     ` Kent Overstreet
2007-02-06 23:04                                       ` Linus Torvalds
2007-02-07  1:22                                         ` Kent Overstreet
2007-02-06 23:23                                       ` Davide Libenzi
2007-02-06 23:39                                         ` Joel Becker
2007-02-06 23:56                                           ` Davide Libenzi
2007-02-07  0:06                                             ` Joel Becker
2007-02-07  0:23                                               ` Davide Libenzi
2007-02-07  0:44                                                 ` Joel Becker
2007-02-07  1:15                                                   ` Davide Libenzi
2007-02-07  1:24                                                     ` Kent Overstreet
2007-02-07  1:30                                                     ` Joel Becker
2007-02-07  6:16                                                   ` Michael K. Edwards
2007-02-07  9:17                                                     ` Michael K. Edwards
2007-02-07  9:37                                                       ` Michael K. Edwards
2007-02-06  0:32                                 ` Davide Libenzi
2007-02-05 21:21                               ` Zach Brown
2007-02-02 23:37             ` Davide Libenzi
2007-02-03  0:02               ` Davide Libenzi
2007-02-05 17:12               ` Zach Brown
2007-02-05 18:24                 ` Davide Libenzi
2007-02-05 21:44                   ` David Miller
2007-02-06  0:15                     ` Davide Libenzi
2007-02-05 21:36               ` bert hubert
2007-02-05 21:57                 ` Linus Torvalds
2007-02-05 22:07                   ` bert hubert
2007-02-05 22:15                     ` Zach Brown
2007-02-05 22:34                   ` Davide Libenzi
2007-02-06  0:27                   ` Scot McKinley
2007-02-06  0:48                     ` David Miller
2007-02-06  0:48                     ` Joel Becker
2007-02-05 17:02             ` Zach Brown
2007-02-05 18:52               ` Davide Libenzi
2007-02-05 19:20                 ` Zach Brown
2007-02-05 19:38                   ` Davide Libenzi
2007-02-04  5:12   ` Davide Libenzi
2007-02-05 17:54     ` Zach Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.