All of lore.kernel.org
 help / color / mirror / Atom feed
From: Davide Libenzi <davidel@xmailserver.org>
To: Zach Brown <zach.brown@oracle.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-aio@kvack.org, Suparna Bhattacharya <suparna@in.ibm.com>,
	Benjamin LaHaise <bcrl@kvack.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
Date: Sat, 3 Feb 2007 21:13:00 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0702031539370.9054@alien.or.mcafeemobile.com> (raw)
In-Reply-To: <patchbomb.1170193181@tetsuo.zabbo.net>

On Tue, 30 Jan 2007, Zach Brown wrote:

> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Zab, great stuff!
I've found a little time to take a look at the patches and throw some 
comments at you.
Keep in mind though, that the last time I seriously looked at this stuff, 
sched.c was like 2K lines of code, and now it's 7K. Enough said ;)




> Right now to provide AIO support for a system call you have to express your
> interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
> translate this into some call path in the kernel that passes in an iocb, and
> then update your code path implement with either completion-based (EIOCBQUEUED)
> or retry-based (EIOCBRETRY) AIO with the iocb.
> 
> This patch series changes this by moving the complexity into generic code such
> that a system call handler would provide AIO support in exactly the same way
> that it supports a synchronous call.  It does this by letting a task have
> multiple stacks executing system calls in the kernel.  Stacks are switched in
> schedule() as they block and are made runnable.

As I said in another email, I think this is a *great* idea. It allows for 
generic async execution, while leaving the core kernel AIO unware. Of 
course, ot be usable, a lot of details needs to be polished, and 
performance evaluated.




> We start in sys_asys_submit().  It allocates a fibril for this executing
> submission syscall and hangs it off the task_struct.  This lets this submission
> fibril be scheduled along with the later asys system calls themselves.

Why do you need this "special" fibril for the submission task?



> The specific switching mechanics of this implementation rely on the notion of
> tracking a stack as a full thread_info pointer.  To make the switch we transfer
> the non-stack bits of the thread_info from the old fibril's ti to the new
> fibril's ti.  We update the book keeping in the task_struct to 
> consider the new thread_info as the current thread_info for the task.  Like so:
> 
>         *next->ti = *ti;
>         *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
> 
>         current->thread_info = next->ti;
>         current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
>         current->fibril = next;
>         current->state = next->state;
>         current->per_call = next->per_call;
> 
> Yeah, messy.  I'm interested in aggressive feedback on how to do this sanely.
> Especially from the perspective of worrying about all the archs.

Maybe an arch-specific copy_thread_info()? Or, since there's a 1:1 
relationship, just merging them.



> Did everyone catch that "per_call" thing there?  That's to switch members of
> task_struct which are local to a specific call.  link_count, journal_info, that
> sort of thing.  More on that as we talk about the costs later.

Yes ;) Brutally copying the structure over does not look good IMO. Better 
keep a pointer and swapping them. A clone_per_call() and free_per_call() 
might be needed.




> Eventually the timer fires and the hrtimer code path wakes the fibril:
> 
> -       if (task)
> -               wake_up_process(task);
> +       if (wake_target)
> +               wake_up_target(wake_target);
> 
> We've doctored try_to_wake_up() to be able to tell if its argument is a
> task_struct or one of these fibril targets.  In the fibril case it calls
> try_to_wake_up_fibril().  It notices that the target fibril does need to be
> woken and sets it TASK_RUNNING.  It notices that it it's current in the task so
> it puts the fibril on the task's fibril run queue and wakes the task.  There's
> grossness here.  It needs the task to come through schedule() again so that it
> can find the new runnable fibril instead of continuing to execute its current
> fibril.  To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.

Fine IMO. Better keep scheduling code localized inside schedule().



> - With get AIO support for all syscalls.  Every single one.  (Well, please, no
> asys sys_exit() :)).  Buffered IO, sendfile, recvmsg, poll, epoll, hardware
> crypto ioctls, open, mmap, getdents, the entire splice API, etc.

Eeek, ... poll, epoll :)
That might solve the async() <-> POSIX bridge in the other way around. The 
collector will become the async() events fetcher, instead of the other way 
around. Will work just fine ...



> - We wouldn't multiply testing and maintenance burden with separate AIO paths.
> No is_sync_kiocb() testing and divergence between returning or calling
> aio_complete().  No auditing to make sure that EIOCBRETRY only being returned
> after any significant references of current->.  No worries about completion
> racing from the submission return path and some aio_complete() being called
> from another context.  In this scheme if your sync syscall path isn't broken,
> your AIO path stands a great chance of working.

This is the *big win* of the whole thing IMO.



> - AIO syscalls which *don't* block see very little overhead.  They'll allocate
> stacks and juggle the run queue locks a little, but they'll execute in turn on
> the submitting (cache-hot, presumably) processor.  There's room to optimize
> this path, too, of course.

Stack allocation can be optimized/cached, as someone else already said.



> - The 800lb elephant in the room.  It uses a full stack per blocked operation.
> I believe this is a reasonable price to pay for the flexibility of having *any*
> call pending.  It rules out some loads which would want to keep *millions* of
> operations pending, but I humbly submit that a load rarely takes that number of
> concurrent ops to saturate a resource.  (think of it this way: we've gotten
> this far by having to burn a full *task* to have *any* syscall pending.)  While
> not optimal, it opens to door to a lot of functionality without having to
> rewrite the kernel as a giant non-blocking state machine.

This should not be a huge problem IMO. High latency operations like 
network sockets can be handled with standard I/O events interfaces like 
poll/epoll, so I do not expect to have a huge number of fibrils around. 
The number of fibrils will be proportional to the number of active 
connections, not to the total number of connections.



> It should be noted that my very first try was to copy the used part of stacks
> in to and out of one full allocated stack.  This uses less memory per blocking
> operation at the cpu cost of copying the used regions.  And it's a terrible
> idea, for at least two reasons.  First, to actually get the memory overhead
> savings you allocate at stack switch time.  If that allocation can't be
> satisfied you are in *trouble* because you might not be able to switch over to
> a fibril that is trying to free up memory.  Deadlock city.  Second, it means
> that *you can't reference on-stack data in the wake-up path*.  This is a
> nightmare.  Even our trivial sys_nanosleep() example would have had to take its
> hrtimer_sleeper off the stack and allocate it.  Never mind, you know, basically
> every single user of <linux/wait.h>.   My current thinking is that it's just
> not worth it.

Agreed. Most definitely not worth it, for the reasons above.



> - We would now have some measure of task_struct concurrency.  Read that twice,
> it's scary.  As two fibrils execute and block in turn they'll each be
> referencing current->.  It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away.  The current implementation
> *does not* let preemption trigger a fibril switch.  So one only has to worry
> about racing with voluntary scheduling of the fibril paths.  This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril.  I think this is a
> manageable burden.

That seems the correct policy in any case.



> - Signals.  I have no idea what behaviour we want.  Help?  My first guess is
> that we'll want signal state to be shared by fibrils by keeping it in the
> task_struct.  If we want something like individual cancellation,  we'll augment
> signal_pending() with some some per-fibril test which will cause it to return
> from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> cancellation, I'll argue) as it would have if a signal was pending.

Fibril should IMO use current thread signal policies. I think a signal 
should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread 
policies mandate that. I'd keep a list_head of currently scheduled-out 
TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is 
delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
The other thing is signal_pending(). The sigpending flag test is not going 
to work as is (cleared at the first do_signal). Setting a bit in each 
fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe 
a sequential signal counter in task_struct, matched by one in the fibril. 
A signal would increment the task_struct counter, and a fibril 
schedule-out would save the task_struct counter to the fibril. The 
signal_pending() for a fibril is a compare of the two. Or something 
similar.
In general, I think it'd make sense to have a fibril-based implemenation 
and a kthread-based one, and compare the messyness :) of the two related 
to cons/performnces.



- Davide


  parent reply	other threads:[~2007-02-04  5:13 UTC|newest]

Thread overview: 151+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01  8:36   ` Ingo Molnar
2007-02-01 13:02     ` Ingo Molnar
2007-02-01 13:19       ` Christoph Hellwig
2007-02-01 13:52         ` Ingo Molnar
2007-02-01 17:13           ` Mark Lord
2007-02-01 18:02             ` Ingo Molnar
2007-02-02 13:23         ` Andi Kleen
2007-02-01 21:52       ` Zach Brown
2007-02-01 22:23         ` Benjamin LaHaise
2007-02-01 22:37           ` Zach Brown
2007-02-02 13:22       ` Andi Kleen
2007-02-01 20:07     ` Linus Torvalds
2007-02-02 10:49       ` Ingo Molnar
2007-02-02 15:56         ` Linus Torvalds
2007-02-02 19:59           ` Alan
2007-02-02 20:14             ` Linus Torvalds
2007-02-02 20:58               ` Davide Libenzi
2007-02-02 21:09                 ` Linus Torvalds
2007-02-02 21:30               ` Alan
2007-02-02 21:30                 ` Linus Torvalds
2007-02-02 22:42                   ` Ingo Molnar
2007-02-02 23:01                     ` Linus Torvalds
2007-02-02 23:17                       ` Linus Torvalds
2007-02-03  0:04                         ` Alan
2007-02-03  0:23                         ` bert hubert
2007-02-02 22:48                   ` Alan
2007-02-05 16:44             ` Zach Brown
2007-02-02 22:21           ` Ingo Molnar
2007-02-02 22:49             ` Linus Torvalds
2007-02-02 23:55               ` Ingo Molnar
2007-02-03  0:56                 ` Linus Torvalds
2007-02-03  7:15                   ` Suparna Bhattacharya
2007-02-03  8:23                   ` Ingo Molnar
2007-02-03  9:25                     ` Matt Mackall
2007-02-03 10:03                       ` Ingo Molnar
2007-02-05 17:44                     ` Zach Brown
2007-02-05 19:26                       ` Davide Libenzi
2007-02-05 19:41                         ` Zach Brown
2007-02-05 20:10                           ` Davide Libenzi
2007-02-05 20:21                             ` Zach Brown
2007-02-05 20:42                               ` Linus Torvalds
2007-02-05 20:39                             ` Linus Torvalds
2007-02-05 21:09                               ` Davide Libenzi
2007-02-05 21:31                                 ` Kent Overstreet
2007-02-06 20:25                                   ` Davide Libenzi
2007-02-06 20:46                                   ` Linus Torvalds
2007-02-06 21:16                                     ` David Miller
2007-02-06 21:28                                       ` Linus Torvalds
2007-02-06 21:31                                         ` David Miller
2007-02-06 21:46                                           ` Eric Dumazet
2007-02-06 21:50                                           ` Linus Torvalds
2007-02-06 22:28                                             ` Zach Brown
2007-02-06 22:45                                     ` Kent Overstreet
2007-02-06 23:04                                       ` Linus Torvalds
2007-02-07  1:22                                         ` Kent Overstreet
2007-02-06 23:23                                       ` Davide Libenzi
2007-02-06 23:39                                         ` Joel Becker
2007-02-06 23:56                                           ` Davide Libenzi
2007-02-07  0:06                                             ` Joel Becker
2007-02-07  0:23                                               ` Davide Libenzi
2007-02-07  0:44                                                 ` Joel Becker
2007-02-07  1:15                                                   ` Davide Libenzi
2007-02-07  1:24                                                     ` Kent Overstreet
2007-02-07  1:30                                                     ` Joel Becker
2007-02-07  6:16                                                   ` Michael K. Edwards
2007-02-07  9:17                                                     ` Michael K. Edwards
2007-02-07  9:37                                                       ` Michael K. Edwards
2007-02-06  0:32                                 ` Davide Libenzi
2007-02-05 21:21                               ` Zach Brown
2007-02-02 23:37             ` Davide Libenzi
2007-02-03  0:02               ` Davide Libenzi
2007-02-05 17:12               ` Zach Brown
2007-02-05 18:24                 ` Davide Libenzi
2007-02-05 21:44                   ` David Miller
2007-02-06  0:15                     ` Davide Libenzi
2007-02-05 21:36               ` bert hubert
2007-02-05 21:57                 ` Linus Torvalds
2007-02-05 22:07                   ` bert hubert
2007-02-05 22:15                     ` Zach Brown
2007-02-05 22:34                   ` Davide Libenzi
2007-02-06  0:27                   ` Scot McKinley
2007-02-06  0:48                     ` David Miller
2007-02-06  0:48                     ` Joel Becker
2007-02-05 17:02             ` Zach Brown
2007-02-05 18:52               ` Davide Libenzi
2007-02-05 19:20                 ` Zach Brown
2007-02-05 19:38                   ` Davide Libenzi
2007-02-04  5:12   ` Davide Libenzi
2007-02-05 17:54     ` Zach Brown
2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
2007-01-31  8:58   ` Andi Kleen
2007-01-31 17:15     ` Zach Brown
2007-01-31 17:21       ` Andi Kleen
2007-01-31 19:23         ` Zach Brown
2007-02-01 11:13           ` Suparna Bhattacharya
2007-02-01 19:50             ` Trond Myklebust
2007-02-02  7:19               ` Suparna Bhattacharya
2007-02-02  7:45                 ` Andi Kleen
2007-02-01 22:18             ` Zach Brown
2007-02-02  3:35               ` Suparna Bhattacharya
2007-02-01 20:26   ` bert hubert
2007-02-01 21:29     ` Zach Brown
2007-02-02  7:12       ` bert hubert
2007-02-04  5:12   ` Davide Libenzi
2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
2007-01-30 22:23   ` Linus Torvalds
2007-01-30 22:53     ` Zach Brown
2007-01-30 22:40   ` Zach Brown
2007-01-30 22:53     ` Linus Torvalds
2007-01-30 23:45       ` Zach Brown
2007-01-31  2:07         ` Benjamin Herrenschmidt
2007-01-31  2:04 ` Benjamin Herrenschmidt
2007-01-31  2:46   ` Linus Torvalds
2007-01-31  3:02     ` Linus Torvalds
2007-01-31 10:50       ` Xavier Bestel
2007-01-31 19:28         ` Zach Brown
2007-01-31 17:59       ` Zach Brown
2007-01-31  5:16     ` Benjamin Herrenschmidt
2007-01-31  5:36     ` Nick Piggin
2007-01-31  5:51       ` Nick Piggin
2007-01-31  6:06       ` Linus Torvalds
2007-01-31  8:43         ` Ingo Molnar
2007-01-31 20:13         ` Joel Becker
2007-01-31 18:20       ` Zach Brown
2007-01-31 17:47     ` Zach Brown
2007-01-31 17:38   ` Zach Brown
2007-01-31 17:51     ` Benjamin LaHaise
2007-01-31 19:25       ` Zach Brown
2007-01-31 20:05         ` Benjamin LaHaise
2007-01-31 20:41           ` Zach Brown
2007-02-04  5:13 ` Davide Libenzi [this message]
2007-02-04 20:00   ` Davide Libenzi
2007-02-09 22:33 ` Linus Torvalds
2007-02-09 23:11   ` Davide Libenzi
2007-02-09 23:35     ` Linus Torvalds
2007-02-10 18:45       ` Davide Libenzi
2007-02-10 19:01         ` Linus Torvalds
2007-02-10 19:35           ` Linus Torvalds
2007-02-10 20:59           ` Davide Libenzi
2007-02-10  0:04   ` Eric Dumazet
2007-02-10  0:12     ` Linus Torvalds
2007-02-10  0:34       ` Alan
2007-02-10 10:47   ` bert hubert
2007-02-10 18:19     ` Davide Libenzi
2007-02-11  0:56   ` David Miller
2007-02-11  2:49     ` Linus Torvalds
2007-02-14 16:42       ` James Antill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0702031539370.9054@alien.or.mcafeemobile.com \
    --to=davidel@xmailserver.org \
    --cc=bcrl@kvack.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=suparna@in.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=zach.brown@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.