linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Davide Libenzi <davidel@xmailserver.org>
To: Zach Brown <zach.brown@oracle.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-aio@kvack.org, Suparna Bhattacharya <suparna@in.ibm.com>,
	Benjamin LaHaise <bcrl@kvack.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
Date: Sat, 3 Feb 2007 21:13:00 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0702031539370.9054@alien.or.mcafeemobile.com> (raw)
In-Reply-To: <patchbomb.1170193181@tetsuo.zabbo.net>

On Tue, 30 Jan 2007, Zach Brown wrote:

> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Zab, great stuff!
I've found a little time to take a look at the patches and throw some 
comments at you.
Keep in mind though, that the last time I seriously looked at this stuff, 
sched.c was like 2K lines of code, and now it's 7K. Enough said ;)




> Right now to provide AIO support for a system call you have to express your
> interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
> translate this into some call path in the kernel that passes in an iocb, and
> then update your code path implement with either completion-based (EIOCBQUEUED)
> or retry-based (EIOCBRETRY) AIO with the iocb.
> 
> This patch series changes this by moving the complexity into generic code such
> that a system call handler would provide AIO support in exactly the same way
> that it supports a synchronous call.  It does this by letting a task have
> multiple stacks executing system calls in the kernel.  Stacks are switched in
> schedule() as they block and are made runnable.

As I said in another email, I think this is a *great* idea. It allows for 
generic async execution, while leaving the core kernel AIO unware. Of 
course, ot be usable, a lot of details needs to be polished, and 
performance evaluated.




> We start in sys_asys_submit().  It allocates a fibril for this executing
> submission syscall and hangs it off the task_struct.  This lets this submission
> fibril be scheduled along with the later asys system calls themselves.

Why do you need this "special" fibril for the submission task?



> The specific switching mechanics of this implementation rely on the notion of
> tracking a stack as a full thread_info pointer.  To make the switch we transfer
> the non-stack bits of the thread_info from the old fibril's ti to the new
> fibril's ti.  We update the book keeping in the task_struct to 
> consider the new thread_info as the current thread_info for the task.  Like so:
> 
>         *next->ti = *ti;
>         *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
> 
>         current->thread_info = next->ti;
>         current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
>         current->fibril = next;
>         current->state = next->state;
>         current->per_call = next->per_call;
> 
> Yeah, messy.  I'm interested in aggressive feedback on how to do this sanely.
> Especially from the perspective of worrying about all the archs.

Maybe an arch-specific copy_thread_info()? Or, since there's a 1:1 
relationship, just merging them.



> Did everyone catch that "per_call" thing there?  That's to switch members of
> task_struct which are local to a specific call.  link_count, journal_info, that
> sort of thing.  More on that as we talk about the costs later.

Yes ;) Brutally copying the structure over does not look good IMO. Better 
keep a pointer and swapping them. A clone_per_call() and free_per_call() 
might be needed.




> Eventually the timer fires and the hrtimer code path wakes the fibril:
> 
> -       if (task)
> -               wake_up_process(task);
> +       if (wake_target)
> +               wake_up_target(wake_target);
> 
> We've doctored try_to_wake_up() to be able to tell if its argument is a
> task_struct or one of these fibril targets.  In the fibril case it calls
> try_to_wake_up_fibril().  It notices that the target fibril does need to be
> woken and sets it TASK_RUNNING.  It notices that it it's current in the task so
> it puts the fibril on the task's fibril run queue and wakes the task.  There's
> grossness here.  It needs the task to come through schedule() again so that it
> can find the new runnable fibril instead of continuing to execute its current
> fibril.  To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.

Fine IMO. Better keep scheduling code localized inside schedule().



> - With get AIO support for all syscalls.  Every single one.  (Well, please, no
> asys sys_exit() :)).  Buffered IO, sendfile, recvmsg, poll, epoll, hardware
> crypto ioctls, open, mmap, getdents, the entire splice API, etc.

Eeek, ... poll, epoll :)
That might solve the async() <-> POSIX bridge in the other way around. The 
collector will become the async() events fetcher, instead of the other way 
around. Will work just fine ...



> - We wouldn't multiply testing and maintenance burden with separate AIO paths.
> No is_sync_kiocb() testing and divergence between returning or calling
> aio_complete().  No auditing to make sure that EIOCBRETRY only being returned
> after any significant references of current->.  No worries about completion
> racing from the submission return path and some aio_complete() being called
> from another context.  In this scheme if your sync syscall path isn't broken,
> your AIO path stands a great chance of working.

This is the *big win* of the whole thing IMO.



> - AIO syscalls which *don't* block see very little overhead.  They'll allocate
> stacks and juggle the run queue locks a little, but they'll execute in turn on
> the submitting (cache-hot, presumably) processor.  There's room to optimize
> this path, too, of course.

Stack allocation can be optimized/cached, as someone else already said.



> - The 800lb elephant in the room.  It uses a full stack per blocked operation.
> I believe this is a reasonable price to pay for the flexibility of having *any*
> call pending.  It rules out some loads which would want to keep *millions* of
> operations pending, but I humbly submit that a load rarely takes that number of
> concurrent ops to saturate a resource.  (think of it this way: we've gotten
> this far by having to burn a full *task* to have *any* syscall pending.)  While
> not optimal, it opens to door to a lot of functionality without having to
> rewrite the kernel as a giant non-blocking state machine.

This should not be a huge problem IMO. High latency operations like 
network sockets can be handled with standard I/O events interfaces like 
poll/epoll, so I do not expect to have a huge number of fibrils around. 
The number of fibrils will be proportional to the number of active 
connections, not to the total number of connections.



> It should be noted that my very first try was to copy the used part of stacks
> in to and out of one full allocated stack.  This uses less memory per blocking
> operation at the cpu cost of copying the used regions.  And it's a terrible
> idea, for at least two reasons.  First, to actually get the memory overhead
> savings you allocate at stack switch time.  If that allocation can't be
> satisfied you are in *trouble* because you might not be able to switch over to
> a fibril that is trying to free up memory.  Deadlock city.  Second, it means
> that *you can't reference on-stack data in the wake-up path*.  This is a
> nightmare.  Even our trivial sys_nanosleep() example would have had to take its
> hrtimer_sleeper off the stack and allocate it.  Never mind, you know, basically
> every single user of <linux/wait.h>.   My current thinking is that it's just
> not worth it.

Agreed. Most definitely not worth it, for the reasons above.



> - We would now have some measure of task_struct concurrency.  Read that twice,
> it's scary.  As two fibrils execute and block in turn they'll each be
> referencing current->.  It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away.  The current implementation
> *does not* let preemption trigger a fibril switch.  So one only has to worry
> about racing with voluntary scheduling of the fibril paths.  This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril.  I think this is a
> manageable burden.

That seems the correct policy in any case.



> - Signals.  I have no idea what behaviour we want.  Help?  My first guess is
> that we'll want signal state to be shared by fibrils by keeping it in the
> task_struct.  If we want something like individual cancellation,  we'll augment
> signal_pending() with some some per-fibril test which will cause it to return
> from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> cancellation, I'll argue) as it would have if a signal was pending.

Fibril should IMO use current thread signal policies. I think a signal 
should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread 
policies mandate that. I'd keep a list_head of currently scheduled-out 
TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is 
delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
The other thing is signal_pending(). The sigpending flag test is not going 
to work as is (cleared at the first do_signal). Setting a bit in each 
fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe 
a sequential signal counter in task_struct, matched by one in the fibril. 
A signal would increment the task_struct counter, and a fibril 
schedule-out would save the task_struct counter to the fibril. The 
signal_pending() for a fibril is a compare of the two. Or something 
similar.
In general, I think it'd make sense to have a fibril-based implemenation 
and a kthread-based one, and compare the messyness :) of the two related 
to cons/performnces.



- Davide


  parent reply	other threads:[~2007-02-04  5:13 UTC|newest]

Thread overview: 151+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01  8:36   ` Ingo Molnar
2007-02-01 13:02     ` Ingo Molnar
2007-02-01 13:19       ` Christoph Hellwig
2007-02-01 13:52         ` Ingo Molnar
2007-02-01 17:13           ` Mark Lord
2007-02-01 18:02             ` Ingo Molnar
2007-02-02 13:23         ` Andi Kleen
2007-02-01 21:52       ` Zach Brown
2007-02-01 22:23         ` Benjamin LaHaise
2007-02-01 22:37           ` Zach Brown
2007-02-02 13:22       ` Andi Kleen
2007-02-01 20:07     ` Linus Torvalds
2007-02-02 10:49       ` Ingo Molnar
2007-02-02 15:56         ` Linus Torvalds
2007-02-02 19:59           ` Alan
2007-02-02 20:14             ` Linus Torvalds
2007-02-02 20:58               ` Davide Libenzi
2007-02-02 21:09                 ` Linus Torvalds
2007-02-02 21:30               ` Alan
2007-02-02 21:30                 ` Linus Torvalds
2007-02-02 22:42                   ` Ingo Molnar
2007-02-02 23:01                     ` Linus Torvalds
2007-02-02 23:17                       ` Linus Torvalds
2007-02-03  0:04                         ` Alan
2007-02-03  0:23                         ` bert hubert
2007-02-02 22:48                   ` Alan
2007-02-05 16:44             ` Zach Brown
2007-02-02 22:21           ` Ingo Molnar
2007-02-02 22:49             ` Linus Torvalds
2007-02-02 23:55               ` Ingo Molnar
2007-02-03  0:56                 ` Linus Torvalds
2007-02-03  7:15                   ` Suparna Bhattacharya
2007-02-03  8:23                   ` Ingo Molnar
2007-02-03  9:25                     ` Matt Mackall
2007-02-03 10:03                       ` Ingo Molnar
2007-02-05 17:44                     ` Zach Brown
2007-02-05 19:26                       ` Davide Libenzi
2007-02-05 19:41                         ` Zach Brown
2007-02-05 20:10                           ` Davide Libenzi
2007-02-05 20:21                             ` Zach Brown
2007-02-05 20:42                               ` Linus Torvalds
2007-02-05 20:39                             ` Linus Torvalds
2007-02-05 21:09                               ` Davide Libenzi
2007-02-05 21:31                                 ` Kent Overstreet
2007-02-06 20:25                                   ` Davide Libenzi
2007-02-06 20:46                                   ` Linus Torvalds
2007-02-06 21:16                                     ` David Miller
2007-02-06 21:28                                       ` Linus Torvalds
2007-02-06 21:31                                         ` David Miller
2007-02-06 21:46                                           ` Eric Dumazet
2007-02-06 21:50                                           ` Linus Torvalds
2007-02-06 22:28                                             ` Zach Brown
2007-02-06 22:45                                     ` Kent Overstreet
2007-02-06 23:04                                       ` Linus Torvalds
2007-02-07  1:22                                         ` Kent Overstreet
2007-02-06 23:23                                       ` Davide Libenzi
2007-02-06 23:39                                         ` Joel Becker
2007-02-06 23:56                                           ` Davide Libenzi
2007-02-07  0:06                                             ` Joel Becker
2007-02-07  0:23                                               ` Davide Libenzi
2007-02-07  0:44                                                 ` Joel Becker
2007-02-07  1:15                                                   ` Davide Libenzi
2007-02-07  1:24                                                     ` Kent Overstreet
2007-02-07  1:30                                                     ` Joel Becker
2007-02-07  6:16                                                   ` Michael K. Edwards
2007-02-07  9:17                                                     ` Michael K. Edwards
2007-02-07  9:37                                                       ` Michael K. Edwards
2007-02-06  0:32                                 ` Davide Libenzi
2007-02-05 21:21                               ` Zach Brown
2007-02-02 23:37             ` Davide Libenzi
2007-02-03  0:02               ` Davide Libenzi
2007-02-05 17:12               ` Zach Brown
2007-02-05 18:24                 ` Davide Libenzi
2007-02-05 21:44                   ` David Miller
2007-02-06  0:15                     ` Davide Libenzi
2007-02-05 21:36               ` bert hubert
2007-02-05 21:57                 ` Linus Torvalds
2007-02-05 22:07                   ` bert hubert
2007-02-05 22:15                     ` Zach Brown
2007-02-05 22:34                   ` Davide Libenzi
2007-02-06  0:27                   ` Scot McKinley
2007-02-06  0:48                     ` David Miller
2007-02-06  0:48                     ` Joel Becker
2007-02-05 17:02             ` Zach Brown
2007-02-05 18:52               ` Davide Libenzi
2007-02-05 19:20                 ` Zach Brown
2007-02-05 19:38                   ` Davide Libenzi
2007-02-04  5:12   ` Davide Libenzi
2007-02-05 17:54     ` Zach Brown
2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
2007-01-31  8:58   ` Andi Kleen
2007-01-31 17:15     ` Zach Brown
2007-01-31 17:21       ` Andi Kleen
2007-01-31 19:23         ` Zach Brown
2007-02-01 11:13           ` Suparna Bhattacharya
2007-02-01 19:50             ` Trond Myklebust
2007-02-02  7:19               ` Suparna Bhattacharya
2007-02-02  7:45                 ` Andi Kleen
2007-02-01 22:18             ` Zach Brown
2007-02-02  3:35               ` Suparna Bhattacharya
2007-02-01 20:26   ` bert hubert
2007-02-01 21:29     ` Zach Brown
2007-02-02  7:12       ` bert hubert
2007-02-04  5:12   ` Davide Libenzi
2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
2007-01-30 22:23   ` Linus Torvalds
2007-01-30 22:53     ` Zach Brown
2007-01-30 22:40   ` Zach Brown
2007-01-30 22:53     ` Linus Torvalds
2007-01-30 23:45       ` Zach Brown
2007-01-31  2:07         ` Benjamin Herrenschmidt
2007-01-31  2:04 ` Benjamin Herrenschmidt
2007-01-31  2:46   ` Linus Torvalds
2007-01-31  3:02     ` Linus Torvalds
2007-01-31 10:50       ` Xavier Bestel
2007-01-31 19:28         ` Zach Brown
2007-01-31 17:59       ` Zach Brown
2007-01-31  5:16     ` Benjamin Herrenschmidt
2007-01-31  5:36     ` Nick Piggin
2007-01-31  5:51       ` Nick Piggin
2007-01-31  6:06       ` Linus Torvalds
2007-01-31  8:43         ` Ingo Molnar
2007-01-31 20:13         ` Joel Becker
2007-01-31 18:20       ` Zach Brown
2007-01-31 17:47     ` Zach Brown
2007-01-31 17:38   ` Zach Brown
2007-01-31 17:51     ` Benjamin LaHaise
2007-01-31 19:25       ` Zach Brown
2007-01-31 20:05         ` Benjamin LaHaise
2007-01-31 20:41           ` Zach Brown
2007-02-04  5:13 ` Davide Libenzi [this message]
2007-02-04 20:00   ` Davide Libenzi
2007-02-09 22:33 ` Linus Torvalds
2007-02-09 23:11   ` Davide Libenzi
2007-02-09 23:35     ` Linus Torvalds
2007-02-10 18:45       ` Davide Libenzi
2007-02-10 19:01         ` Linus Torvalds
2007-02-10 19:35           ` Linus Torvalds
2007-02-10 20:59           ` Davide Libenzi
2007-02-10  0:04   ` Eric Dumazet
2007-02-10  0:12     ` Linus Torvalds
2007-02-10  0:34       ` Alan
2007-02-10 10:47   ` bert hubert
2007-02-10 18:19     ` Davide Libenzi
2007-02-11  0:56   ` David Miller
2007-02-11  2:49     ` Linus Torvalds
2007-02-14 16:42       ` James Antill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0702031539370.9054@alien.or.mcafeemobile.com \
    --to=davidel@xmailserver.org \
    --cc=bcrl@kvack.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=suparna@in.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=zach.brown@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).