From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Zach Brown <zach.brown@oracle.com>,
linux-kernel@vger.kernel.org, linux-aio@kvack.org,
Suparna Bhattacharya <suparna@in.ibm.com>,
Benjamin LaHaise <bcrl@kvack.org>
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
Date: Sat, 3 Feb 2007 09:23:08 +0100 [thread overview]
Message-ID: <20070203082308.GA6748@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.64.0702021636410.15057@woody.linux-foundation.org>
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> >
> > Well, in my picture, 'only if you block' is a pure thread
> > utilization decision: bounce a piece of work to another thread if
> > this thread cannot complete it. (if the kernel is lucky enough that
> > the user context told it "it's fine to do that".)
>
> Sure, you can do it that way too. But at that point, your argument
> that we shouldn't do it with fibrils is wrong: you'd still need
> basically the exact same setup that Zach does in his fibril stuff, and
> the exact same hook in the scheduler, testing the exact same value
> ("do we have a pending queue of work").
did i ever lose a single word of complaint about those bits? Those are
not an issue to me. They can be applied to kernel threads just as much.
As i babbled in the very first email about this topic:
| 1) improve our basic #1 design gradually. If something is a
| bottleneck, if the scheduler has grown too fat, cut some slack. If
| micro-threads or fibrils offer anything nice for our basic thread
| model: integrate it into the kernel.
i should have said explicitly that to flip user-space from one kernel
thread to another one (upon blocking or per request) is a nice thing and
we should integrate that into the kernel's thread model.
But really, being a scheduler guy i was much more concerned about the
duplication and problems caused by the fibril concept itself - which
duplication and complexity makes up 80% of Zach's submitted patchset.
For example this bit:
[PATCH 3 of 4] Teach paths to wake a specific void * target
would totally go away if we used kernel threads for this. In the fibril
approach this is where the mess starts. Either a 'normal' wakeup has to
wake up all fibrils, or we have to make damn sure that a wakeup that in
reality goes to a fibril is never woken via wake_up/wake_up_process.
( Furthremore, i tried to include user-space micro-threads in the
argument as well, which Evgeniy Polyako raised not so long ago related
to the kevent patchset. All these micro-thread things are of a similar
genre. )
i totally agree that the API /should/ be the main focus - but i didnt
pick the topic and most of the patchset's current size is due to the IMO
avoidable fibril concept.
regarding the API, i dont really agree with the current form and design
of Zach's interface.
fundamentally, the basic entity of this thing should be a /system call/,
not the artificial fibril thing:
+struct asys_call {
+ struct asys_result *result;
+ struct fibril fibril;
+};
i.e. the basic entity should be something that represents a system call,
with its up to 6 arguments, the later return code, state, flags and two
list entries:
struct async_syscall {
unsigned long nr;
unsigned long args[6];
long err;
unsigned long state;
unsigned long flags;
struct list_head list;
struct list_head wait_list;
unsigned long __pad[2];
};
(64 bytes on 32-bit, 128 bytes on 64-bit)
furthermore, i think this API should be fundamentally vectored and
fundamentally async, and hence could solve another issue as well:
submitting many little pieces of work of different IO domains in one go.
[ detail: there should be no traditional signals used at all (Zach's
stuff doesnt use them, and correctly so), only if the async syscall
that is performed generates a signal. ]
The normal and most optimal workflow should be a user-space ring-buffer
of these constant-size struct async_syscall entries:
struct async_syscall ringbuffer[1024];
LIST_HEAD(submitted);
LIST_HEAD(pending);
LIST_HEAD(completed);
the 3 list heads are both known to the kernel and to user-space, and are
actively managed by both. The kernel drives the execution of the async
system calls based on the 'submitted' list head (until it empties it)
and moves them over to the 'pending' list. User-space can complete async
syscalls based on the 'completed' list. (but a sycall can optinally be
marked as 'autocomplete' as well via the 'flags' field, in that case
it's not moved to the 'completed' list but simply removed from the
'pending' list. This can be useful for system calls that have some
implicit notification effect.)
( Note: optionally, a helper kernel-thread, when it finishes processing
a syscall, could also asynchronously check the 'submitted' list and
pick up new work. That would allow the submission of new syscalls
without any entry into the kernel. So for example on an SMT system,
this could result in essence one CPU could running in pure user-space
submitting async syscalls via the ringbuffer, while another CPU would
in essence be running pure kernel-space, executing those entries. )
another crutial bit is the waiting on pending work. But because every
pending syscall entity is either already completed or has a real kernel
thread associated with it, that bit is mostly trivial: user-space can
wait on 'any' pending syscall to complete, or it could wait for a
specific list of syscalls to complete (using the ->wait_list). It could
also wait on 'a minimum number of N syscalls to complete' - to create
batching of execution. And of course it can periodically check the
'completed' list head if it has a constant and highly parallel flow of
workload - that way the 'waiting' does not actually have to happen most
of the time.
Looks like we can hit many birds with this single stone: AIO, vectored
syscalls, finegrained system-call parallelism. Hm?
Ingo
next prev parent reply other threads:[~2007-02-03 8:37 UTC|newest]
Thread overview: 153+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01 8:36 ` Ingo Molnar
2007-02-01 13:02 ` Ingo Molnar
2007-02-01 13:19 ` Christoph Hellwig
2007-02-01 13:52 ` Ingo Molnar
2007-02-01 17:13 ` Mark Lord
2007-02-01 18:02 ` Ingo Molnar
2007-02-02 13:23 ` Andi Kleen
2007-02-01 21:52 ` Zach Brown
2007-02-01 22:23 ` Benjamin LaHaise
2007-02-01 22:37 ` Zach Brown
2007-02-02 13:22 ` Andi Kleen
2007-02-01 20:07 ` Linus Torvalds
2007-02-02 10:49 ` Ingo Molnar
2007-02-02 15:56 ` Linus Torvalds
2007-02-02 19:59 ` Alan
2007-02-02 20:14 ` Linus Torvalds
2007-02-02 20:58 ` Davide Libenzi
2007-02-02 21:09 ` Linus Torvalds
2007-02-02 21:30 ` Alan
2007-02-02 21:30 ` Linus Torvalds
2007-02-02 22:42 ` Ingo Molnar
2007-02-02 23:01 ` Linus Torvalds
2007-02-02 23:17 ` Linus Torvalds
2007-02-03 0:04 ` Alan
2007-02-03 0:23 ` bert hubert
2007-02-02 22:48 ` Alan
2007-02-05 16:44 ` Zach Brown
2007-02-02 22:21 ` Ingo Molnar
2007-02-02 22:49 ` Linus Torvalds
2007-02-02 23:55 ` Ingo Molnar
2007-02-03 0:56 ` Linus Torvalds
2007-02-03 7:15 ` Suparna Bhattacharya
2007-02-03 8:23 ` Ingo Molnar [this message]
2007-02-03 9:25 ` Matt Mackall
2007-02-03 10:03 ` Ingo Molnar
2007-02-05 17:44 ` Zach Brown
2007-02-05 19:26 ` Davide Libenzi
2007-02-05 19:41 ` Zach Brown
2007-02-05 20:10 ` Davide Libenzi
2007-02-05 20:21 ` Zach Brown
2007-02-05 20:42 ` Linus Torvalds
2007-02-05 20:39 ` Linus Torvalds
2007-02-05 21:09 ` Davide Libenzi
2007-02-05 21:31 ` Kent Overstreet
2007-02-06 20:25 ` Davide Libenzi
2007-02-06 20:46 ` Linus Torvalds
2007-02-06 21:16 ` David Miller
2007-02-06 21:28 ` Linus Torvalds
2007-02-06 21:31 ` David Miller
2007-02-06 21:46 ` Eric Dumazet
2007-02-06 21:50 ` Linus Torvalds
2007-02-06 22:28 ` Zach Brown
2007-02-06 22:45 ` Kent Overstreet
2007-02-06 23:04 ` Linus Torvalds
2007-02-07 1:22 ` Kent Overstreet
2007-02-06 23:23 ` Davide Libenzi
2007-02-06 23:39 ` Joel Becker
2007-02-06 23:56 ` Davide Libenzi
2007-02-07 0:06 ` Joel Becker
2007-02-07 0:23 ` Davide Libenzi
2007-02-07 0:44 ` Joel Becker
2007-02-07 1:15 ` Davide Libenzi
2007-02-07 1:24 ` Kent Overstreet
2007-02-07 1:30 ` Joel Becker
2007-02-07 6:16 ` Michael K. Edwards
2007-02-07 9:17 ` Michael K. Edwards
2007-02-07 9:37 ` Michael K. Edwards
2007-02-06 0:32 ` Davide Libenzi
2007-02-05 21:21 ` Zach Brown
2007-02-02 23:37 ` Davide Libenzi
2007-02-03 0:02 ` Davide Libenzi
2007-02-05 17:12 ` Zach Brown
2007-02-05 18:24 ` Davide Libenzi
2007-02-05 21:44 ` David Miller
2007-02-06 0:15 ` Davide Libenzi
2007-02-05 21:36 ` bert hubert
2007-02-05 21:57 ` Linus Torvalds
2007-02-05 22:07 ` bert hubert
2007-02-05 22:15 ` Zach Brown
2007-02-05 22:34 ` Davide Libenzi
2007-02-06 0:27 ` Scot McKinley
2007-02-06 0:48 ` David Miller
2007-02-06 0:48 ` Joel Becker
2007-02-05 17:02 ` Zach Brown
2007-02-05 18:52 ` Davide Libenzi
2007-02-05 19:20 ` Zach Brown
2007-02-05 19:38 ` Davide Libenzi
2007-02-04 5:12 ` Davide Libenzi
2007-02-05 17:54 ` Zach Brown
2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
2007-01-31 8:58 ` Andi Kleen
2007-01-31 17:15 ` Zach Brown
2007-01-31 17:21 ` Andi Kleen
2007-01-31 19:23 ` Zach Brown
2007-02-01 11:13 ` Suparna Bhattacharya
2007-02-01 19:50 ` Trond Myklebust
2007-02-02 7:19 ` Suparna Bhattacharya
2007-02-02 7:45 ` Andi Kleen
2007-02-01 22:18 ` Zach Brown
2007-02-02 3:35 ` Suparna Bhattacharya
2007-02-01 20:26 ` bert hubert
2007-02-01 21:29 ` Zach Brown
2007-02-02 7:12 ` bert hubert
2007-02-04 5:12 ` Davide Libenzi
2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
2007-01-30 22:23 ` Linus Torvalds
2007-01-30 22:53 ` Zach Brown
2007-01-30 22:40 ` Zach Brown
2007-01-30 22:53 ` Linus Torvalds
2007-01-30 23:45 ` Zach Brown
2007-01-31 2:07 ` Benjamin Herrenschmidt
2007-01-31 2:04 ` Benjamin Herrenschmidt
2007-01-31 2:46 ` Linus Torvalds
2007-01-31 3:02 ` Linus Torvalds
2007-01-31 10:50 ` Xavier Bestel
2007-01-31 19:28 ` Zach Brown
2007-01-31 17:59 ` Zach Brown
2007-01-31 5:16 ` Benjamin Herrenschmidt
2007-01-31 5:36 ` Nick Piggin
2007-01-31 5:51 ` Nick Piggin
2007-01-31 6:06 ` Linus Torvalds
2007-01-31 8:43 ` Ingo Molnar
2007-01-31 20:13 ` Joel Becker
2007-01-31 18:20 ` Zach Brown
2007-01-31 17:47 ` Zach Brown
2007-01-31 17:38 ` Zach Brown
2007-01-31 17:51 ` Benjamin LaHaise
2007-01-31 19:25 ` Zach Brown
2007-01-31 20:05 ` Benjamin LaHaise
2007-01-31 20:41 ` Zach Brown
2007-02-04 5:13 ` Davide Libenzi
2007-02-04 20:00 ` Davide Libenzi
2007-02-09 22:33 ` Linus Torvalds
2007-02-09 23:11 ` Davide Libenzi
2007-02-09 23:35 ` Linus Torvalds
2007-02-10 18:45 ` Davide Libenzi
2007-02-10 19:01 ` Linus Torvalds
2007-02-10 19:35 ` Linus Torvalds
2007-02-10 20:59 ` Davide Libenzi
2007-02-10 0:04 ` Eric Dumazet
2007-02-10 0:12 ` Linus Torvalds
2007-02-10 0:34 ` Alan
2007-02-10 10:47 ` bert hubert
2007-02-10 18:19 ` Davide Libenzi
2007-02-11 0:56 ` David Miller
2007-02-11 2:49 ` Linus Torvalds
2007-02-14 16:42 ` James Antill
2007-02-03 14:05 [PATCH 2 of 4] Introduce i386 fibril scheduling linux
2007-02-06 13:43 Al Boldi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070203082308.GA6748@elte.hu \
--to=mingo@elte.hu \
--cc=bcrl@kvack.org \
--cc=linux-aio@kvack.org \
--cc=linux-kernel@vger.kernel.org \
--cc=suparna@in.ibm.com \
--cc=torvalds@linux-foundation.org \
--cc=zach.brown@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.