From: Davide Libenzi <davidel@xmailserver.org>
To: Zach Brown <zach.brown@oracle.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-aio@kvack.org, Suparna Bhattacharya <suparna@in.ibm.com>,
Benjamin LaHaise <bcrl@kvack.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
Date: Sat, 3 Feb 2007 21:13:00 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0702031539370.9054@alien.or.mcafeemobile.com> (raw)
In-Reply-To: <patchbomb.1170193181@tetsuo.zabbo.net>
On Tue, 30 Jan 2007, Zach Brown wrote:
> This very rough patch series introduces a different way to provide AIO support
> for system calls.
Zab, great stuff!
I've found a little time to take a look at the patches and throw some
comments at you.
Keep in mind though, that the last time I seriously looked at this stuff,
sched.c was like 2K lines of code, and now it's 7K. Enough said ;)
> Right now to provide AIO support for a system call you have to express your
> interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
> translate this into some call path in the kernel that passes in an iocb, and
> then update your code path implement with either completion-based (EIOCBQUEUED)
> or retry-based (EIOCBRETRY) AIO with the iocb.
>
> This patch series changes this by moving the complexity into generic code such
> that a system call handler would provide AIO support in exactly the same way
> that it supports a synchronous call. It does this by letting a task have
> multiple stacks executing system calls in the kernel. Stacks are switched in
> schedule() as they block and are made runnable.
As I said in another email, I think this is a *great* idea. It allows for
generic async execution, while leaving the core kernel AIO unware. Of
course, ot be usable, a lot of details needs to be polished, and
performance evaluated.
> We start in sys_asys_submit(). It allocates a fibril for this executing
> submission syscall and hangs it off the task_struct. This lets this submission
> fibril be scheduled along with the later asys system calls themselves.
Why do you need this "special" fibril for the submission task?
> The specific switching mechanics of this implementation rely on the notion of
> tracking a stack as a full thread_info pointer. To make the switch we transfer
> the non-stack bits of the thread_info from the old fibril's ti to the new
> fibril's ti. We update the book keeping in the task_struct to
> consider the new thread_info as the current thread_info for the task. Like so:
>
> *next->ti = *ti;
> *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
>
> current->thread_info = next->ti;
> current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
> current->fibril = next;
> current->state = next->state;
> current->per_call = next->per_call;
>
> Yeah, messy. I'm interested in aggressive feedback on how to do this sanely.
> Especially from the perspective of worrying about all the archs.
Maybe an arch-specific copy_thread_info()? Or, since there's a 1:1
relationship, just merging them.
> Did everyone catch that "per_call" thing there? That's to switch members of
> task_struct which are local to a specific call. link_count, journal_info, that
> sort of thing. More on that as we talk about the costs later.
Yes ;) Brutally copying the structure over does not look good IMO. Better
keep a pointer and swapping them. A clone_per_call() and free_per_call()
might be needed.
> Eventually the timer fires and the hrtimer code path wakes the fibril:
>
> - if (task)
> - wake_up_process(task);
> + if (wake_target)
> + wake_up_target(wake_target);
>
> We've doctored try_to_wake_up() to be able to tell if its argument is a
> task_struct or one of these fibril targets. In the fibril case it calls
> try_to_wake_up_fibril(). It notices that the target fibril does need to be
> woken and sets it TASK_RUNNING. It notices that it it's current in the task so
> it puts the fibril on the task's fibril run queue and wakes the task. There's
> grossness here. It needs the task to come through schedule() again so that it
> can find the new runnable fibril instead of continuing to execute its current
> fibril. To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.
Fine IMO. Better keep scheduling code localized inside schedule().
> - With get AIO support for all syscalls. Every single one. (Well, please, no
> asys sys_exit() :)). Buffered IO, sendfile, recvmsg, poll, epoll, hardware
> crypto ioctls, open, mmap, getdents, the entire splice API, etc.
Eeek, ... poll, epoll :)
That might solve the async() <-> POSIX bridge in the other way around. The
collector will become the async() events fetcher, instead of the other way
around. Will work just fine ...
> - We wouldn't multiply testing and maintenance burden with separate AIO paths.
> No is_sync_kiocb() testing and divergence between returning or calling
> aio_complete(). No auditing to make sure that EIOCBRETRY only being returned
> after any significant references of current->. No worries about completion
> racing from the submission return path and some aio_complete() being called
> from another context. In this scheme if your sync syscall path isn't broken,
> your AIO path stands a great chance of working.
This is the *big win* of the whole thing IMO.
> - AIO syscalls which *don't* block see very little overhead. They'll allocate
> stacks and juggle the run queue locks a little, but they'll execute in turn on
> the submitting (cache-hot, presumably) processor. There's room to optimize
> this path, too, of course.
Stack allocation can be optimized/cached, as someone else already said.
> - The 800lb elephant in the room. It uses a full stack per blocked operation.
> I believe this is a reasonable price to pay for the flexibility of having *any*
> call pending. It rules out some loads which would want to keep *millions* of
> operations pending, but I humbly submit that a load rarely takes that number of
> concurrent ops to saturate a resource. (think of it this way: we've gotten
> this far by having to burn a full *task* to have *any* syscall pending.) While
> not optimal, it opens to door to a lot of functionality without having to
> rewrite the kernel as a giant non-blocking state machine.
This should not be a huge problem IMO. High latency operations like
network sockets can be handled with standard I/O events interfaces like
poll/epoll, so I do not expect to have a huge number of fibrils around.
The number of fibrils will be proportional to the number of active
connections, not to the total number of connections.
> It should be noted that my very first try was to copy the used part of stacks
> in to and out of one full allocated stack. This uses less memory per blocking
> operation at the cpu cost of copying the used regions. And it's a terrible
> idea, for at least two reasons. First, to actually get the memory overhead
> savings you allocate at stack switch time. If that allocation can't be
> satisfied you are in *trouble* because you might not be able to switch over to
> a fibril that is trying to free up memory. Deadlock city. Second, it means
> that *you can't reference on-stack data in the wake-up path*. This is a
> nightmare. Even our trivial sys_nanosleep() example would have had to take its
> hrtimer_sleeper off the stack and allocate it. Never mind, you know, basically
> every single user of <linux/wait.h>. My current thinking is that it's just
> not worth it.
Agreed. Most definitely not worth it, for the reasons above.
> - We would now have some measure of task_struct concurrency. Read that twice,
> it's scary. As two fibrils execute and block in turn they'll each be
> referencing current->. It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away. The current implementation
> *does not* let preemption trigger a fibril switch. So one only has to worry
> about racing with voluntary scheduling of the fibril paths. This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril. I think this is a
> manageable burden.
That seems the correct policy in any case.
> - Signals. I have no idea what behaviour we want. Help? My first guess is
> that we'll want signal state to be shared by fibrils by keeping it in the
> task_struct. If we want something like individual cancellation, we'll augment
> signal_pending() with some some per-fibril test which will cause it to return
> from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> cancellation, I'll argue) as it would have if a signal was pending.
Fibril should IMO use current thread signal policies. I think a signal
should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread
policies mandate that. I'd keep a list_head of currently scheduled-out
TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is
delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
The other thing is signal_pending(). The sigpending flag test is not going
to work as is (cleared at the first do_signal). Setting a bit in each
fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe
a sequential signal counter in task_struct, matched by one in the fibril.
A signal would increment the task_struct counter, and a fibril
schedule-out would save the task_struct counter to the fibril. The
signal_pending() for a fibril is a compare of the two. Or something
similar.
In general, I think it'd make sense to have a fibril-based implemenation
and a kthread-based one, and compare the messyness :) of the two related
to cons/performnces.
- Davide
next prev parent reply other threads:[~2007-02-04 5:13 UTC|newest]
Thread overview: 151+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01 8:36 ` Ingo Molnar
2007-02-01 13:02 ` Ingo Molnar
2007-02-01 13:19 ` Christoph Hellwig
2007-02-01 13:52 ` Ingo Molnar
2007-02-01 17:13 ` Mark Lord
2007-02-01 18:02 ` Ingo Molnar
2007-02-02 13:23 ` Andi Kleen
2007-02-01 21:52 ` Zach Brown
2007-02-01 22:23 ` Benjamin LaHaise
2007-02-01 22:37 ` Zach Brown
2007-02-02 13:22 ` Andi Kleen
2007-02-01 20:07 ` Linus Torvalds
2007-02-02 10:49 ` Ingo Molnar
2007-02-02 15:56 ` Linus Torvalds
2007-02-02 19:59 ` Alan
2007-02-02 20:14 ` Linus Torvalds
2007-02-02 20:58 ` Davide Libenzi
2007-02-02 21:09 ` Linus Torvalds
2007-02-02 21:30 ` Alan
2007-02-02 21:30 ` Linus Torvalds
2007-02-02 22:42 ` Ingo Molnar
2007-02-02 23:01 ` Linus Torvalds
2007-02-02 23:17 ` Linus Torvalds
2007-02-03 0:04 ` Alan
2007-02-03 0:23 ` bert hubert
2007-02-02 22:48 ` Alan
2007-02-05 16:44 ` Zach Brown
2007-02-02 22:21 ` Ingo Molnar
2007-02-02 22:49 ` Linus Torvalds
2007-02-02 23:55 ` Ingo Molnar
2007-02-03 0:56 ` Linus Torvalds
2007-02-03 7:15 ` Suparna Bhattacharya
2007-02-03 8:23 ` Ingo Molnar
2007-02-03 9:25 ` Matt Mackall
2007-02-03 10:03 ` Ingo Molnar
2007-02-05 17:44 ` Zach Brown
2007-02-05 19:26 ` Davide Libenzi
2007-02-05 19:41 ` Zach Brown
2007-02-05 20:10 ` Davide Libenzi
2007-02-05 20:21 ` Zach Brown
2007-02-05 20:42 ` Linus Torvalds
2007-02-05 20:39 ` Linus Torvalds
2007-02-05 21:09 ` Davide Libenzi
2007-02-05 21:31 ` Kent Overstreet
2007-02-06 20:25 ` Davide Libenzi
2007-02-06 20:46 ` Linus Torvalds
2007-02-06 21:16 ` David Miller
2007-02-06 21:28 ` Linus Torvalds
2007-02-06 21:31 ` David Miller
2007-02-06 21:46 ` Eric Dumazet
2007-02-06 21:50 ` Linus Torvalds
2007-02-06 22:28 ` Zach Brown
2007-02-06 22:45 ` Kent Overstreet
2007-02-06 23:04 ` Linus Torvalds
2007-02-07 1:22 ` Kent Overstreet
2007-02-06 23:23 ` Davide Libenzi
2007-02-06 23:39 ` Joel Becker
2007-02-06 23:56 ` Davide Libenzi
2007-02-07 0:06 ` Joel Becker
2007-02-07 0:23 ` Davide Libenzi
2007-02-07 0:44 ` Joel Becker
2007-02-07 1:15 ` Davide Libenzi
2007-02-07 1:24 ` Kent Overstreet
2007-02-07 1:30 ` Joel Becker
2007-02-07 6:16 ` Michael K. Edwards
2007-02-07 9:17 ` Michael K. Edwards
2007-02-07 9:37 ` Michael K. Edwards
2007-02-06 0:32 ` Davide Libenzi
2007-02-05 21:21 ` Zach Brown
2007-02-02 23:37 ` Davide Libenzi
2007-02-03 0:02 ` Davide Libenzi
2007-02-05 17:12 ` Zach Brown
2007-02-05 18:24 ` Davide Libenzi
2007-02-05 21:44 ` David Miller
2007-02-06 0:15 ` Davide Libenzi
2007-02-05 21:36 ` bert hubert
2007-02-05 21:57 ` Linus Torvalds
2007-02-05 22:07 ` bert hubert
2007-02-05 22:15 ` Zach Brown
2007-02-05 22:34 ` Davide Libenzi
2007-02-06 0:27 ` Scot McKinley
2007-02-06 0:48 ` David Miller
2007-02-06 0:48 ` Joel Becker
2007-02-05 17:02 ` Zach Brown
2007-02-05 18:52 ` Davide Libenzi
2007-02-05 19:20 ` Zach Brown
2007-02-05 19:38 ` Davide Libenzi
2007-02-04 5:12 ` Davide Libenzi
2007-02-05 17:54 ` Zach Brown
2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
2007-01-31 8:58 ` Andi Kleen
2007-01-31 17:15 ` Zach Brown
2007-01-31 17:21 ` Andi Kleen
2007-01-31 19:23 ` Zach Brown
2007-02-01 11:13 ` Suparna Bhattacharya
2007-02-01 19:50 ` Trond Myklebust
2007-02-02 7:19 ` Suparna Bhattacharya
2007-02-02 7:45 ` Andi Kleen
2007-02-01 22:18 ` Zach Brown
2007-02-02 3:35 ` Suparna Bhattacharya
2007-02-01 20:26 ` bert hubert
2007-02-01 21:29 ` Zach Brown
2007-02-02 7:12 ` bert hubert
2007-02-04 5:12 ` Davide Libenzi
2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
2007-01-30 22:23 ` Linus Torvalds
2007-01-30 22:53 ` Zach Brown
2007-01-30 22:40 ` Zach Brown
2007-01-30 22:53 ` Linus Torvalds
2007-01-30 23:45 ` Zach Brown
2007-01-31 2:07 ` Benjamin Herrenschmidt
2007-01-31 2:04 ` Benjamin Herrenschmidt
2007-01-31 2:46 ` Linus Torvalds
2007-01-31 3:02 ` Linus Torvalds
2007-01-31 10:50 ` Xavier Bestel
2007-01-31 19:28 ` Zach Brown
2007-01-31 17:59 ` Zach Brown
2007-01-31 5:16 ` Benjamin Herrenschmidt
2007-01-31 5:36 ` Nick Piggin
2007-01-31 5:51 ` Nick Piggin
2007-01-31 6:06 ` Linus Torvalds
2007-01-31 8:43 ` Ingo Molnar
2007-01-31 20:13 ` Joel Becker
2007-01-31 18:20 ` Zach Brown
2007-01-31 17:47 ` Zach Brown
2007-01-31 17:38 ` Zach Brown
2007-01-31 17:51 ` Benjamin LaHaise
2007-01-31 19:25 ` Zach Brown
2007-01-31 20:05 ` Benjamin LaHaise
2007-01-31 20:41 ` Zach Brown
2007-02-04 5:13 ` Davide Libenzi [this message]
2007-02-04 20:00 ` Davide Libenzi
2007-02-09 22:33 ` Linus Torvalds
2007-02-09 23:11 ` Davide Libenzi
2007-02-09 23:35 ` Linus Torvalds
2007-02-10 18:45 ` Davide Libenzi
2007-02-10 19:01 ` Linus Torvalds
2007-02-10 19:35 ` Linus Torvalds
2007-02-10 20:59 ` Davide Libenzi
2007-02-10 0:04 ` Eric Dumazet
2007-02-10 0:12 ` Linus Torvalds
2007-02-10 0:34 ` Alan
2007-02-10 10:47 ` bert hubert
2007-02-10 18:19 ` Davide Libenzi
2007-02-11 0:56 ` David Miller
2007-02-11 2:49 ` Linus Torvalds
2007-02-14 16:42 ` James Antill
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0702031539370.9054@alien.or.mcafeemobile.com \
--to=davidel@xmailserver.org \
--cc=bcrl@kvack.org \
--cc=linux-aio@kvack.org \
--cc=linux-kernel@vger.kernel.org \
--cc=suparna@in.ibm.com \
--cc=torvalds@linux-foundation.org \
--cc=zach.brown@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).