All of lore.kernel.org
 help / color / mirror / Atom feed
* Approaches to making io_submit not block
@ 2011-08-29 17:33 Daniel Ehrenberg
  2011-08-30  5:32 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-29 17:33 UTC (permalink / raw)
  To: linux-kernel

Hi,

The Linux AIO interface (io_submit, io_getevents, etc) is useful in
allowing multiple requests to through the I/O stack without requiring
a userspace or even kernel thread per pending request. This is really
great for maxing out high-performance devices like SSDs. However, it
seems incomplete to me because io_submit sometimes blocks for a couple
filesystem-related reasons. I'm wondering if this could be fixed, or
if there is an inherent need for this sort of blocking.

- Blocking due to reading metadata.
Proposed solution:
Add a per-ioctx work queue to do metadata reads. It will be triggered
from the dio code: if in async mode, then get_block will be called
with an additional flag, meaning something like O_NONBLOCK on sockets.
File systems' get_block functions can implement this flag and return
-EAGAIN if a read from the underlying device would be necessary. (If
we're worried that EAGAIN might be used for other purposes in the
future, we could make a new errno for this purpose.) From a quick
glance at the code, it looks like this would not be too difficult to
add to ext4 for extent-based files, and support in other file systems
could be added gradually. If -EAGAIN is returned, then the struct dio
will be put on the work queue together with a description of what kind
of processing it was doing. The work queue only serves the metadata
request, and the rest of the request is served on the existing path.

- Blocking for appends and writes to file holes due to the need for a
metadata write after the data write
Proposed solution:
Maintain a work queue for all appends and writes to file holes, which
executes the current code.

Has anything like this been discussed or implemented? What I'm talking
about isn't optimal in terms of parallelism; it just matches the
parallelism of the current approach (with the minor caveat that
multiple threads on the same core calling io_submit on the same ioctx
don't get to run their metadata/append I/O requests concurrently), but
allows the io_submit system call to return to userspace much faster.
I've read about other proposals for general asynchronous syscalls, but
this would be lighter weight in not requiring a kernel task per I/O
request.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-29 17:33 Approaches to making io_submit not block Daniel Ehrenberg
@ 2011-08-30  5:32 ` Christoph Hellwig
  2011-08-30 21:51   ` Daniel Ehrenberg
  2011-08-30  7:02 ` Andi Kleen
       [not found] ` <CAAK6Zt0Sh1GdEOb-tNf2FGXJs=e1Jbcqew13R_GdTqrv6vW97w@mail.gmail.com>
  2 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2011-08-30  5:32 UTC (permalink / raw)
  To: Daniel Ehrenberg; +Cc: linux-kernel

On Mon, Aug 29, 2011 at 10:33:24AM -0700, Daniel Ehrenberg wrote:
> - Blocking due to reading metadata.
> Proposed solution:
> Add a per-ioctx work queue to do metadata reads. It will be triggered
> from the dio code: if in async mode, then get_block will be called
> with an additional flag, meaning something like O_NONBLOCK on sockets.
> File systems' get_block functions can implement this flag and return
> -EAGAIN if a read from the underlying device would be necessary. (If
> we're worried that EAGAIN might be used for other purposes in the
> future, we could make a new errno for this purpose.) From a quick
> glance at the code, it looks like this would not be too difficult to
> add to ext4 for extent-based files, and support in other file systems
> could be added gradually. If -EAGAIN is returned, then the struct dio
> will be put on the work queue together with a description of what kind
> of processing it was doing. The work queue only serves the metadata
> request, and the rest of the request is served on the existing path.

Let filesystems handle this.  I've actually prototyped it in XFS,
based on some pending work from Dave but at this point it's still butt
ugly.

> - Blocking for appends and writes to file holes due to the need for a
> metadata write after the data write
> Proposed solution:
> Maintain a work queue for all appends and writes to file holes, which
> executes the current code.

No way.  I've fixed this for XFS, and it's trivial without the need to
queue them up.  The only thing preventing appending writes to work is
a flag to tell the dio layer to just do them, just like it already works
for holes.  (and more QA).


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-29 17:33 Approaches to making io_submit not block Daniel Ehrenberg
  2011-08-30  5:32 ` Christoph Hellwig
@ 2011-08-30  7:02 ` Andi Kleen
       [not found] ` <CAAK6Zt0Sh1GdEOb-tNf2FGXJs=e1Jbcqew13R_GdTqrv6vW97w@mail.gmail.com>
  2 siblings, 0 replies; 41+ messages in thread
From: Andi Kleen @ 2011-08-30  7:02 UTC (permalink / raw)
  To: Daniel Ehrenberg; +Cc: linux-kernel

Daniel Ehrenberg <dehrenberg@google.com> writes:
>
> Has anything like this been discussed or implemented?

There was a lot of discussion and some patches on "retry based AIO"
a few years ago. Didn't really go anywhere, but there are still
assorted leftovers in the code.

Then there was the "syslets" approach, but that is also still born.

Probably needs to be revisited from scratch. The network
layer is also badly in need of a better aio interface that supports
zero copy.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30  5:32 ` Christoph Hellwig
@ 2011-08-30 21:51   ` Daniel Ehrenberg
  2011-08-31  5:26     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-30 21:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel

Thanks for getting back to me.

On Mon, Aug 29, 2011 at 10:32 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Aug 29, 2011 at 10:33:24AM -0700, Daniel Ehrenberg wrote:
>> - Blocking due to reading metadata.
>> Proposed solution:
>> Add a per-ioctx work queue to do metadata reads. It will be triggered
>> from the dio code: if in async mode, then get_block will be called
>> with an additional flag, meaning something like O_NONBLOCK on sockets.
>> File systems' get_block functions can implement this flag and return
>> -EAGAIN if a read from the underlying device would be necessary. (If
>> we're worried that EAGAIN might be used for other purposes in the
>> future, we could make a new errno for this purpose.) From a quick
>> glance at the code, it looks like this would not be too difficult to
>> add to ext4 for extent-based files, and support in other file systems
>> could be added gradually. If -EAGAIN is returned, then the struct dio
>> will be put on the work queue together with a description of what kind
>> of processing it was doing. The work queue only serves the metadata
>> request, and the rest of the request is served on the existing path.
>
> Let filesystems handle this.  I've actually prototyped it in XFS,
> based on some pending work from Dave but at this point it's still butt
> ugly.

Great, would you be willing to let me see the draft code?

Are you sure that there wouldn't be any benefit to having the code be
in the aio/dio levels in terms of making it easier for file
systems/reducing code duplication?
>
>> - Blocking for appends and writes to file holes due to the need for a
>> metadata write after the data write
>> Proposed solution:
>> Maintain a work queue for all appends and writes to file holes, which
>> executes the current code.
>
> No way.  I've fixed this for XFS, and it's trivial without the need to
> queue them up.  The only thing preventing appending writes to work is
> a flag to tell the dio layer to just do them, just like it already works
> for holes.  (and more QA).
>
>
Are you saying this is already fixed for XFS? Appends don't block,
only reads to metadata do?

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
       [not found]     ` <4E5D5817.6040704@kernel.dk>
@ 2011-08-30 22:19       ` Daniel Ehrenberg
  2011-08-30 22:32         ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-30 22:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel, linux-aio

On Tue, Aug 30, 2011 at 2:37 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2011-08-30 15:30, Jeff Moyer wrote:
>> Daniel Ehrenberg <dehrenberg@google.com> writes:
>>
>>> Hi Jens, Jeff,
>>>
>>> I just sent a letter to LKML wondering about changes to io_submit that
>>> I'm thinking of working on. Based on your past contributions to this
>>> area, I'd really like to know what you think of this plan--how well it
>>> matches with the existing design, the potential for inclusion in
>>> upstream Linux, if you see problems.
>>
>> Hi, Dan,
>>
>> Thanks for taking the time to make AIO better!  There is a mailing list
>> for aio discussions: linux-aio@kvack.org, so please CC that in the
>> future (I don't read lkml anymore).
>>
>> Right now I'm a bit inundated, so I can't give this a proper review.
>> I should be able to free up some time in the next two weeks, though.
>>
>> In the mean time, you can google for suparna's retry-based aio patches.
>> Specifically, take a look at how she used prepare_to_wait/finish_wait.
>> If you haven't done any empirical tests to see where io_submit blocks,
>> there is a sample systemtap script for that:
>>   http://sourceware.org/systemtap/examples/io/io_submit.stp
>> Other attempts at non-blocking aio were off the deep end: fibrils and
>> syslets.  Fibrils didn't go anywhere because Ingo didn't like them (for
>> good reason, they essentially introduced another scheduling layer).
>> Syslets didn't go anywhere b/c they were insane (returned to the
>> user-space process with a different PID, among other things!).
>>
>> If you do go forward in the meantime, you can likely use EIOCBRETRY
>> instead of EAGAIN.
>>
>> I hope that helps!
>
> FWIW, I updated the buffered AIO retry patches some time after Suparna
> droped them. By the date stamp in my branch, they are now 23 months
> old... Anyway, at least it's more recent, you can find them here:
>
> http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-buffered
>
> --
> Jens Axboe
>
>
Thanks! Do you know why the patches weren't merged? I can't find much
discussion about them.

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:19       ` Daniel Ehrenberg
@ 2011-08-30 22:32         ` Jens Axboe
  2011-08-30 22:41           ` Andrew Morton
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2011-08-30 22:32 UTC (permalink / raw)
  To: Daniel Ehrenberg; +Cc: Jeff Moyer, linux-kernel, linux-aio, Andrew Morton

On 2011-08-30 16:19, Daniel Ehrenberg wrote:
> On Tue, Aug 30, 2011 at 2:37 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2011-08-30 15:30, Jeff Moyer wrote:
>>> Daniel Ehrenberg <dehrenberg@google.com> writes:
>>>
>>>> Hi Jens, Jeff,
>>>>
>>>> I just sent a letter to LKML wondering about changes to io_submit that
>>>> I'm thinking of working on. Based on your past contributions to this
>>>> area, I'd really like to know what you think of this plan--how well it
>>>> matches with the existing design, the potential for inclusion in
>>>> upstream Linux, if you see problems.
>>>
>>> Hi, Dan,
>>>
>>> Thanks for taking the time to make AIO better!  There is a mailing list
>>> for aio discussions: linux-aio@kvack.org, so please CC that in the
>>> future (I don't read lkml anymore).
>>>
>>> Right now I'm a bit inundated, so I can't give this a proper review.
>>> I should be able to free up some time in the next two weeks, though.
>>>
>>> In the mean time, you can google for suparna's retry-based aio patches.
>>> Specifically, take a look at how she used prepare_to_wait/finish_wait.
>>> If you haven't done any empirical tests to see where io_submit blocks,
>>> there is a sample systemtap script for that:
>>>   http://sourceware.org/systemtap/examples/io/io_submit.stp
>>> Other attempts at non-blocking aio were off the deep end: fibrils and
>>> syslets.  Fibrils didn't go anywhere because Ingo didn't like them (for
>>> good reason, they essentially introduced another scheduling layer).
>>> Syslets didn't go anywhere b/c they were insane (returned to the
>>> user-space process with a different PID, among other things!).
>>>
>>> If you do go forward in the meantime, you can likely use EIOCBRETRY
>>> instead of EAGAIN.
>>>
>>> I hope that helps!
>>
>> FWIW, I updated the buffered AIO retry patches some time after Suparna
>> droped them. By the date stamp in my branch, they are now 23 months
>> old... Anyway, at least it's more recent, you can find them here:
>>
>> http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-buffered
>>
>> --
>> Jens Axboe
>>
>>
> Thanks! Do you know why the patches weren't merged? I can't find much
> discussion about them.

Not quite sure, and after working on them and fixing thing up, I don't
even think they are that complex or intrusive (which I think otherwise
would've been the main objection). Andrew may know/remember.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:32         ` Jens Axboe
@ 2011-08-30 22:41           ` Andrew Morton
  2011-08-30 22:45             ` Daniel Ehrenberg
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2011-08-30 22:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Daniel Ehrenberg, Jeff Moyer, linux-kernel, linux-aio

On Tue, 30 Aug 2011 16:32:08 -0600
Jens Axboe <axboe@kernel.dk> wrote:

> On 2011-08-30 16:19, Daniel Ehrenberg wrote:
> > On Tue, Aug 30, 2011 at 2:37 PM, Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2011-08-30 15:30, Jeff Moyer wrote:
> >>> Daniel Ehrenberg <dehrenberg@google.com> writes:
> >>>
> >>>> Hi Jens, Jeff,
> >>>>
> >>>> I just sent a letter to LKML wondering about changes to io_submit that
> >>>> I'm thinking of working on. Based on your past contributions to this
> >>>> area, I'd really like to know what you think of this plan--how well it
> >>>> matches with the existing design, the potential for inclusion in
> >>>> upstream Linux, if you see problems.
> >>>
> >>> Hi, Dan,
> >>>
> >>> Thanks for taking the time to make AIO better!  There is a mailing list
> >>> for aio discussions: linux-aio@kvack.org, so please CC that in the
> >>> future (I don't read lkml anymore).
> >>>
> >>> Right now I'm a bit inundated, so I can't give this a proper review.
> >>> I should be able to free up some time in the next two weeks, though.
> >>>
> >>> In the mean time, you can google for suparna's retry-based aio patches.
> >>> Specifically, take a look at how she used prepare_to_wait/finish_wait.
> >>> If you haven't done any empirical tests to see where io_submit blocks,
> >>> there is a sample systemtap script for that:
> >>>   http://sourceware.org/systemtap/examples/io/io_submit.stp
> >>> Other attempts at non-blocking aio were off the deep end: fibrils and
> >>> syslets.  Fibrils didn't go anywhere because Ingo didn't like them (for
> >>> good reason, they essentially introduced another scheduling layer).
> >>> Syslets didn't go anywhere b/c they were insane (returned to the
> >>> user-space process with a different PID, among other things!).
> >>>
> >>> If you do go forward in the meantime, you can likely use EIOCBRETRY
> >>> instead of EAGAIN.
> >>>
> >>> I hope that helps!
> >>
> >> FWIW, I updated the buffered AIO retry patches some time after Suparna
> >> droped them. By the date stamp in my branch, they are now 23 months
> >> old... Anyway, at least it's more recent, you can find them here:
> >>
> >> http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-buffered
> >>
> >> --
> >> Jens Axboe
> >>
> >>
> > Thanks! Do you know why the patches weren't merged? I can't find much
> > discussion about them.
> 
> Not quite sure, and after working on them and fixing thing up, I don't
> even think they are that complex or intrusive (which I think otherwise
> would've been the main objection). Andrew may know/remember.

Boy, that was a long time ago.  I was always unhappy with the patches
because of the amount of additional code/complexity they added.

Then the great syslets/threadlets design session happened and it was
expected that such a facility would make special async handling for AIO
unnecessary.  Then syslets/threadlets didn't happen.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:41           ` Andrew Morton
@ 2011-08-30 22:45             ` Daniel Ehrenberg
  2011-08-30 22:54               ` Andrew Morton
  0 siblings, 1 reply; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-30 22:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Tue, Aug 30, 2011 at 3:41 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 30 Aug 2011 16:32:08 -0600
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 2011-08-30 16:19, Daniel Ehrenberg wrote:
>> > On Tue, Aug 30, 2011 at 2:37 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> >> On 2011-08-30 15:30, Jeff Moyer wrote:
>> >>> Daniel Ehrenberg <dehrenberg@google.com> writes:
>> >>>
>> >>>> Hi Jens, Jeff,
>> >>>>
>> >>>> I just sent a letter to LKML wondering about changes to io_submit that
>> >>>> I'm thinking of working on. Based on your past contributions to this
>> >>>> area, I'd really like to know what you think of this plan--how well it
>> >>>> matches with the existing design, the potential for inclusion in
>> >>>> upstream Linux, if you see problems.
>> >>>
>> >>> Hi, Dan,
>> >>>
>> >>> Thanks for taking the time to make AIO better!  There is a mailing list
>> >>> for aio discussions: linux-aio@kvack.org, so please CC that in the
>> >>> future (I don't read lkml anymore).
>> >>>
>> >>> Right now I'm a bit inundated, so I can't give this a proper review.
>> >>> I should be able to free up some time in the next two weeks, though.
>> >>>
>> >>> In the mean time, you can google for suparna's retry-based aio patches.
>> >>> Specifically, take a look at how she used prepare_to_wait/finish_wait.
>> >>> If you haven't done any empirical tests to see where io_submit blocks,
>> >>> there is a sample systemtap script for that:
>> >>>   http://sourceware.org/systemtap/examples/io/io_submit.stp
>> >>> Other attempts at non-blocking aio were off the deep end: fibrils and
>> >>> syslets.  Fibrils didn't go anywhere because Ingo didn't like them (for
>> >>> good reason, they essentially introduced another scheduling layer).
>> >>> Syslets didn't go anywhere b/c they were insane (returned to the
>> >>> user-space process with a different PID, among other things!).
>> >>>
>> >>> If you do go forward in the meantime, you can likely use EIOCBRETRY
>> >>> instead of EAGAIN.
>> >>>
>> >>> I hope that helps!
>> >>
>> >> FWIW, I updated the buffered AIO retry patches some time after Suparna
>> >> droped them. By the date stamp in my branch, they are now 23 months
>> >> old... Anyway, at least it's more recent, you can find them here:
>> >>
>> >> http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-buffered
>> >>
>> >> --
>> >> Jens Axboe
>> >>
>> >>
>> > Thanks! Do you know why the patches weren't merged? I can't find much
>> > discussion about them.
>>
>> Not quite sure, and after working on them and fixing thing up, I don't
>> even think they are that complex or intrusive (which I think otherwise
>> would've been the main objection). Andrew may know/remember.
>
> Boy, that was a long time ago.  I was always unhappy with the patches
> because of the amount of additional code/complexity they added.
>
> Then the great syslets/threadlets design session happened and it was
> expected that such a facility would make special async handling for AIO
> unnecessary.  Then syslets/threadlets didn't happen.

Do you think we could accomplish the goals with less additional
code/complexity? It looks like the latest version of the patch set
wasn't so invasive.

If syslets/threadlets aren't happening, should these patches be
reconsidered for inclusion in the kernel?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:45             ` Daniel Ehrenberg
@ 2011-08-30 22:54               ` Andrew Morton
  2011-08-30 23:03                 ` Jeremy Allison
                                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Andrew Morton @ 2011-08-30 22:54 UTC (permalink / raw)
  To: Daniel Ehrenberg; +Cc: Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Tue, 30 Aug 2011 15:45:35 -0700
Daniel Ehrenberg <dehrenberg@google.com> wrote:

> >> Not quite sure, and after working on them and fixing thing up, I don't
> >> even think they are that complex or intrusive (which I think otherwise
> >> would've been the main objection). Andrew may know/remember.
> >
> > Boy, that was a long time ago. __I was always unhappy with the patches
> > because of the amount of additional code/complexity they added.
> >
> > Then the great syslets/threadlets design session happened and it was
> > expected that such a facility would make special async handling for AIO
> > unnecessary. __Then syslets/threadlets didn't happen.
> 
> Do you think we could accomplish the goals with less additional
> code/complexity? It looks like the latest version of the patch set
> wasn't so invasive.
> 
> If syslets/threadlets aren't happening, should these patches be
> reconsidered for inclusion in the kernel?

I haven't seen any demand at all for the feature in many years.  That
doesn't mean that there _isn't_ any demand - perhaps everyone got
exhausted.

If there is demand then that should be described and circulated, see
how much interest there is in resurrecting the effort.

And, of course, the patches should be dragged out and looked at - it's
been a number of years now.

Also, glibc has userspace for POSIX AIO.  A successful kernel-based
implementation would result in glibc migrating away from its current
implementation.  So we should work with the glibc developers on ensuring
that the migration can happen.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:54               ` Andrew Morton
@ 2011-08-30 23:03                 ` Jeremy Allison
  2011-08-30 23:11                   ` Andrew Morton
  2011-08-31  5:34                   ` Christoph Hellwig
  2011-08-31  6:04                 ` guy keren
  2011-08-31 15:45                 ` Gleb Natapov
  2 siblings, 2 replies; 41+ messages in thread
From: Jeremy Allison @ 2011-08-30 23:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Tue, Aug 30, 2011 at 03:54:38PM -0700, Andrew Morton wrote:
> 
> Also, glibc has userspace for POSIX AIO.  A successful kernel-based
> implementation would result in glibc migrating away from its current
> implementation.  So we should work with the glibc developers on ensuring
> that the migration can happen.

Unfortunately the glibc userspace POSIX AIO limits asynchronicity to
one outstanding request per file descriptor. From aio_misc.c in glibc:

  if (runp != NULL
      && runp->aiocbp->aiocb.aio_fildes == aiocbp->aiocb.aio_fildes)
    {
      /* The current file descriptor is worked on.  It makes no sense
         to start another thread since this new thread would fight
         with the running thread for the resources.  But we also cannot
         say that the thread processing this desriptor shall immediately
         after finishing the current job process this request if there
         are other threads in the running queue which have a higher
         priority.  */

      /* Simply enqueue it after the running one according to the
         priority.  */

I have often wondered if this is actually the case ? I created
my own glibc with a patches AIO that removed this restriction
(thus had multiple outstanding threads on a single fd). In testing
I saw a dramatic increase in performance (2x speedup) but then
testing with use in actual code (Samba smbd) it made the client
throughput *worse*. I never got to the bottom of this and so
didn't submit my fixes to glibc.

Any ideas if this is still the case ? Or comments on why glibc
insists on only one outstanding request per fd ? Is this really
needed for kernel performance ?

Jeremy.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 23:03                 ` Jeremy Allison
@ 2011-08-30 23:11                   ` Andrew Morton
  2011-08-31 11:04                     ` Ulrich Drepper
  2011-08-31  5:34                   ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2011-08-30 23:11 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel,
	linux-aio, Ulrich Drepper

On Tue, 30 Aug 2011 16:03:42 -0700
Jeremy Allison <jra@samba.org> wrote:

> On Tue, Aug 30, 2011 at 03:54:38PM -0700, Andrew Morton wrote:
> > 
> > Also, glibc has userspace for POSIX AIO.  A successful kernel-based
> > implementation would result in glibc migrating away from its current
> > implementation.  So we should work with the glibc developers on ensuring
> > that the migration can happen.
> 
> Unfortunately the glibc userspace POSIX AIO limits asynchronicity to
> one outstanding request per file descriptor. From aio_misc.c in glibc:
> 
>   if (runp != NULL
>       && runp->aiocbp->aiocb.aio_fildes == aiocbp->aiocb.aio_fildes)
>     {
>       /* The current file descriptor is worked on.  It makes no sense
>          to start another thread since this new thread would fight
>          with the running thread for the resources.  But we also cannot
>          say that the thread processing this desriptor shall immediately
>          after finishing the current job process this request if there
>          are other threads in the running queue which have a higher
>          priority.  */
> 
>       /* Simply enqueue it after the running one according to the
>          priority.  */
> 
> I have often wondered if this is actually the case ? I created
> my own glibc with a patches AIO that removed this restriction
> (thus had multiple outstanding threads on a single fd). In testing
> I saw a dramatic increase in performance (2x speedup) but then
> testing with use in actual code (Samba smbd) it made the client
> throughput *worse*. I never got to the bottom of this and so
> didn't submit my fixes to glibc.
> 
> Any ideas if this is still the case ? Or comments on why glibc
> insists on only one outstanding request per fd ? Is this really
> needed for kernel performance ?
> 

I don't know.  Uli cc'ed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 21:51   ` Daniel Ehrenberg
@ 2011-08-31  5:26     ` Christoph Hellwig
  2011-08-31 17:08       ` Andi Kleen
  2011-09-01  3:39       ` Dave Chinner
  0 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2011-08-31  5:26 UTC (permalink / raw)
  To: Daniel Ehrenberg; +Cc: Christoph Hellwig, linux-kernel

On Tue, Aug 30, 2011 at 02:51:01PM -0700, Daniel Ehrenberg wrote:
> > Let filesystems handle this. ?I've actually prototyped it in XFS,
> > based on some pending work from Dave but at this point it's still butt
> > ugly.
> 
> Great, would you be willing to let me see the draft code?
> 
> Are you sure that there wouldn't be any benefit to having the code be
> in the aio/dio levels in terms of making it easier for file
> systems/reducing code duplication?

I'll get it polished up and send it out for RFC once Dave sends out
the updated allocation workqueue patch.  With this he moves all
allocator calls in XFS into a workqueue.  My direct I/O patch uses that
fact to use that workqueue for the allocator call and let the existing
aio retry infrastructure retry the direct I/O operation one that
workqueue has finished.

> > No way. ?I've fixed this for XFS, and it's trivial without the need to
> > queue them up. ?The only thing preventing appending writes to work is
> > a flag to tell the dio layer to just do them, just like it already works
> > for holes. ?(and more QA).
> >
> >
> Are you saying this is already fixed for XFS? Appends don't block,
> only reads to metadata do?

XFS-internal yes, just the generic direct I/O code doesn't let us do it
yet.  Remember that XFS only updates the on-disk i_size after I/O
completion.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 23:03                 ` Jeremy Allison
  2011-08-30 23:11                   ` Andrew Morton
@ 2011-08-31  5:34                   ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2011-08-31  5:34 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Andrew Morton, Daniel Ehrenberg, Jens Axboe, Jeff Moyer,
	linux-kernel, linux-aio

On Tue, Aug 30, 2011 at 04:03:42PM -0700, Jeremy Allison wrote:
> I have often wondered if this is actually the case ? I created
> my own glibc with a patches AIO that removed this restriction
> (thus had multiple outstanding threads on a single fd). In testing
> I saw a dramatic increase in performance (2x speedup) but then
> testing with use in actual code (Samba smbd) it made the client
> throughput *worse*. I never got to the bottom of this and so
> didn't submit my fixes to glibc.
> 
> Any ideas if this is still the case ? Or comments on why glibc
> insists on only one outstanding request per fd ? Is this really
> needed for kernel performance ?

At least for writes you'll simply have multiple requests blocking on
i_mutex.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:54               ` Andrew Morton
  2011-08-30 23:03                 ` Jeremy Allison
@ 2011-08-31  6:04                 ` guy keren
  2011-08-31 23:16                   ` Daniel Ehrenberg
  2011-08-31 15:45                 ` Gleb Natapov
  2 siblings, 1 reply; 41+ messages in thread
From: guy keren @ 2011-08-31  6:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Tue, 2011-08-30 at 15:54 -0700, Andrew Morton wrote:
> On Tue, 30 Aug 2011 15:45:35 -0700
> Daniel Ehrenberg <dehrenberg@google.com> wrote:
> 
> > >> Not quite sure, and after working on them and fixing thing up, I don't
> > >> even think they are that complex or intrusive (which I think otherwise
> > >> would've been the main objection). Andrew may know/remember.
> > >
> > > Boy, that was a long time ago. __I was always unhappy with the patches
> > > because of the amount of additional code/complexity they added.
> > >
> > > Then the great syslets/threadlets design session happened and it was
> > > expected that such a facility would make special async handling for AIO
> > > unnecessary. __Then syslets/threadlets didn't happen.
> > 
> > Do you think we could accomplish the goals with less additional
> > code/complexity? It looks like the latest version of the patch set
> > wasn't so invasive.
> > 
> > If syslets/threadlets aren't happening, should these patches be
> > reconsidered for inclusion in the kernel?
> 
> I haven't seen any demand at all for the feature in many years.  That
> doesn't mean that there _isn't_ any demand - perhaps everyone got
> exhausted.

you should consider the emerging enterprise-grade SSD devices - which
can serve several tens of thousands of I/O requests per device actually
controller). These devices could be better utilized by better
interfaces. further more, in our company we had to resort to using
windows for IOPS benchmarking (using iometer) against storage systems
using these (and similar) devices, because it manages to generate higher
IOPS then linux can (i don't remember the exact numbers, but we are
talking about an order of several hundred thousands IOPS).

It could be that we are currently an esoteric use-case - but the
high-end performance market seems to be stepping in that direction.

> If there is demand then that should be described and circulated, see
> how much interest there is in resurrecting the effort.
> 
> And, of course, the patches should be dragged out and looked at - it's
> been a number of years now.
> 
> Also, glibc has userspace for POSIX AIO.  A successful kernel-based
> implementation would result in glibc migrating away from its current
> implementation.  So we should work with the glibc developers on ensuring
> that the migration can happen.

glibc's userspace implementation doesn't scale to fast devices. It could
make sense when working with slower disk devices - not when you're
working with solid-state storage devices.

--guy


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 23:11                   ` Andrew Morton
@ 2011-08-31 11:04                     ` Ulrich Drepper
  2011-08-31 16:59                       ` Jeremy Allison
  0 siblings, 1 reply; 41+ messages in thread
From: Ulrich Drepper @ 2011-08-31 11:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Daniel Ehrenberg, Jens Axboe, Jeff Moyer,
	linux-kernel, linux-aio

[-- Attachment #1: Type: text/plain, Size: 524 bytes --]

On 08/30/2011 07:11 PM, Andrew Morton wrote:
> I don't know.  Uli cc'ed.

glibc has to create the parallelism of the operations through threads.
More threads mean more overhead.  There are insane people out there who
wrote code which pushed the number of helper threads into the hundreds
(at that time a high number, today this would be in the thousands).
Anyway, any scheme which is worth changing the code for would be in the
kernel.  Only the kernel knows which requests can actually be handled
concurrently.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-30 22:54               ` Andrew Morton
  2011-08-30 23:03                 ` Jeremy Allison
  2011-08-31  6:04                 ` guy keren
@ 2011-08-31 15:45                 ` Gleb Natapov
  2011-08-31 16:02                   ` Avi Kivity
  2 siblings, 1 reply; 41+ messages in thread
From: Gleb Natapov @ 2011-08-31 15:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Tue, Aug 30, 2011 at 03:54:38PM -0700, Andrew Morton wrote:
> On Tue, 30 Aug 2011 15:45:35 -0700
> Daniel Ehrenberg <dehrenberg@google.com> wrote:
> 
> > >> Not quite sure, and after working on them and fixing thing up, I don't
> > >> even think they are that complex or intrusive (which I think otherwise
> > >> would've been the main objection). Andrew may know/remember.
> > >
> > > Boy, that was a long time ago. __I was always unhappy with the patches
> > > because of the amount of additional code/complexity they added.
> > >
> > > Then the great syslets/threadlets design session happened and it was
> > > expected that such a facility would make special async handling for AIO
> > > unnecessary. __Then syslets/threadlets didn't happen.
> > 
> > Do you think we could accomplish the goals with less additional
> > code/complexity? It looks like the latest version of the patch set
> > wasn't so invasive.
> > 
> > If syslets/threadlets aren't happening, should these patches be
> > reconsidered for inclusion in the kernel?
> 
> I haven't seen any demand at all for the feature in many years.  That
> doesn't mean that there _isn't_ any demand - perhaps everyone got
> exhausted.
> 
> If there is demand then that should be described and circulated, see
> how much interest there is in resurrecting the effort.
> 
KVM also have similar needs. KVM has x86 emulator in kernel which is,
in fact, a state machines that sometimes need an input from userspace
to proceed.  Currently, when userspace input is needed, KVM goes back
to userspace to retrieve the input and than retries the emulation. Some
instructions may require several such iterations. This is somewhat similar
to aio except that in KVM case emulation waits for userspace instead of
disk/network HW. The resulting code is complex and error prone. It would
be nice to not have to unwind the stack from the middle of the emulator
just to be able to exit to userspace to retrieve the value. One idea that
came up was to execute emulator on separate kernel stack (withing same
task). When emulator needs a value from userspace it sleeps while main
stack goes to userspace to get the value. When the value is available
main stack wakes up emulator stack and emulation continues from the
place it was stopped. Cooperative multithreading inside the kernel
if you want. Bellow is the patch I prototyped to implement that on
x86_64. I made KVM x86 emulator to use it too. I think AIO can use the
same technique. io_submit will execute IO on alternative stack. If it
blocks main thread will continue to run. When IO is completed IO stack
will resume (alternative stack has priority over main stack).


diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 0d1171c..4d85ec8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -472,6 +472,14 @@ struct thread_struct {
 	unsigned		io_bitmap_max;
 };
 
+struct arch_stack_struct {
+	unsigned long sp;
+};
+
+struct stack_struct;
+
+extern void arch_alt_stack_setup(struct stack_struct *stack);
+
 static inline unsigned long native_get_debugreg(int regno)
 {
 	unsigned long val = 0;	/* Damn you, gcc! */
diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h
index c2ff2a1..ade6756 100644
--- a/arch/x86/include/asm/system.h
+++ b/arch/x86/include/asm/system.h
@@ -18,12 +18,15 @@
 #endif
 
 struct task_struct; /* one of the stranger aspects of C forward declarations */
+struct stack_struct;
 struct task_struct *__switch_to(struct task_struct *prev,
 				struct task_struct *next);
 struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 extern void show_regs_common(void);
+extern struct stack_struct* switch_alt_stack(struct stack_struct *,
+				   	     struct stack_struct *);
 
 #ifdef CONFIG_X86_32
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f693e44..2d17e0d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -658,3 +658,52 @@ unsigned long KSTK_ESP(struct task_struct *task)
 	return (test_tsk_thread_flag(task, TIF_IA32)) ?
 			(task_pt_regs(task)->sp) : ((task)->thread.usersp);
 }
+
+struct stack_struct* switch_alt_stack(struct stack_struct *prev,
+		struct stack_struct *next)
+{
+	struct stack_struct* last;
+
+	next->ti->flags = prev->ti->flags;
+	next->ti->status = prev->ti->status;
+	next->ti->cpu = prev->ti->cpu;
+	next->ti->preempt_count =  prev->ti->preempt_count;
+
+	prev->state = current->state;
+	current->thread.sp = next->arch.sp;
+	current->stack = next->ti;
+	current->state = next->state;
+	current->current_stack = next;
+	percpu_write(kernel_stack,
+			(unsigned long)task_stack_page(current) +
+			THREAD_SIZE - KERNEL_STACK_OFFSET);
+
+	/* ->flags can be updated by other CPUs during the switch */
+	atomic_set_mask(prev->ti->flags, &next->ti->flags);
+
+	/* switch stack */
+	asm volatile("pushq %%rbp\n\t"
+		     "movq %%rsp, %P[sp](%[prev])\n\t"
+		     "movq %P[sp](%[next]),%%rsp\n\t"
+		     "cmpl %[stack_start], %P[stack_state](%[next])\n\t"
+		     "jne 1f\n\t"
+		     "jmp start_alt_stack\n\t"
+		     "1:\n\t"
+		     "popq %%rbp\n\t"
+		     "movq %[prev], %[last]\n\t"
+		     : [last] "=a" (last)
+		     : [prev] "S" (prev), [next] "D" (next),
+		     [sp] "i" (offsetof(struct stack_struct, arch.sp)),
+		     [stack_start] "i" (STACK_START),
+		     [stack_state] "i" (offsetof(struct stack_struct, stack_state)) :
+		     "memory", "cc", "rbx", "rcx", "rdx",
+		     "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15");
+
+	return last;
+}
+
+void arch_alt_stack_setup(struct stack_struct *stack)
+{
+	stack->arch.sp = ((unsigned long)stack->ti + THREAD_SIZE);
+}
+
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index d14e058..fe6964b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -164,6 +164,10 @@ extern struct cred init_cred;
 	RCU_INIT_POINTER(.cred, &init_cred),				\
 	.comm		= "swapper",					\
 	.thread		= INIT_THREAD,					\
+	.main_stack.ti	= &init_thread_info,				\
+	.main_stack.stack_state = STACK_LIVE,				\
+	.current_stack	= &tsk.main_stack,				\
+	.stacks		= LIST_HEAD_INIT(tsk.stacks),			\
 	.fs		= &init_fs,					\
 	.files		= &init_files,					\
 	.signal		= &init_signals,				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ac2c05..551fefe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1217,6 +1217,22 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+struct stack_struct {
+	struct list_head next;
+	void (*fn)(void*);
+	void *arg;
+	volatile long state;
+	struct thread_info *ti;
+	u32 flags;
+	enum {STACK_LIVE, STACK_START, STACK_DEAD} stack_state;
+	struct arch_stack_struct arch;
+};
+
+#define SSF_AUTODELETE		(1<<0)
+#define SSF_RESTORE_STATE	(1<<1)
+#define SSF_START_WAITED	(1<<2)
+#define SSF_FORCE_SWITCH	(1<<3)
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1380,6 +1396,10 @@ struct task_struct {
 #endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
+	struct list_head stacks;
+	struct stack_struct main_stack;
+	struct stack_struct *current_stack;
+
 /* filesystem information */
 	struct fs_struct *fs;
 /* open file information */
@@ -2714,4 +2734,14 @@ static inline unsigned long rlimit_max(unsigned int limit)
 
 #endif /* __KERNEL__ */
 
+/* alt stacks */
+extern int init_alt_stack(struct stack_struct *s, void (*fn)(void*), void *arg,
+		bool  autodelete);
+extern void deinit_alt_stack(struct stack_struct *stack);
+extern void launch_alt_stack(struct stack_struct *stack);
+extern int run_on_alt_stack(void (*fn)(void*), void *arg);
+extern void exit_alt_stack(void);
+extern void schedule_alt_stack_tail(struct task_struct *p);
+extern void wait_alt_stacks(struct task_struct *tsk);
+extern NORET_TYPE void start_alt_stack(struct stack_struct *stack);
 #endif
diff --git a/kernel/exit.c b/kernel/exit.c
index 2913b35..90ea9eb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -941,6 +941,7 @@ NORET_TYPE void do_exit(long code)
 	exit_irq_thread();
 
 	exit_signals(tsk);  /* sets PF_EXITING */
+	wait_alt_stacks(tsk);
 	/*
 	 * tsk->flags are checked in the futex code to protect against
 	 * an exiting task cleaning up the robust pi futexes.
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..34f28b8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -274,6 +274,12 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	tsk->stack = ti;
 
+	tsk->main_stack.ti = ti;
+	tsk->main_stack.stack_state = STACK_LIVE;
+	tsk->current_stack = &tsk->main_stack;
+	INIT_LIST_HEAD(&tsk->stacks);
+	list_add(&tsk->main_stack.next, &tsk->stacks);
+
 	err = prop_local_init_single(&tsk->dirties);
 	if (err)
 		goto out;
diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..d69579c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2667,6 +2667,57 @@ static void ttwu_queue(struct task_struct *p, int cpu)
 	raw_spin_unlock(&rq->lock);
 }
 
+static bool wake_up_alt_stacks(struct task_struct *p, unsigned int state)
+{
+	struct list_head *e;
+	bool found = false, unlock = false;
+	struct rq *rq;
+
+	if (p->on_rq) {
+		/*
+		 * rq lock protects against race with walking the stacks
+		 * list in schedule()
+		 */
+		rq = __task_rq_lock(p);
+		if (!p->on_rq)
+			__task_rq_unlock(rq);
+		else
+			unlock = true;
+	}
+
+	list_for_each(e, &p->stacks) {
+		struct stack_struct *ss =
+			list_entry(e, struct stack_struct, next);
+
+		if (p->current_stack == ss || !(ss->state & state))
+			continue;
+
+		ss->state = TASK_RUNNING;
+		found = true;
+	}
+
+	if (p->state == TASK_RUNNING) {
+		if (found && p->current_stack == &p->main_stack) {
+			p->current_stack->flags |= SSF_FORCE_SWITCH;
+			set_tsk_need_resched(p);
+			kick_process(p);
+		}
+		found = false;
+	} else if (!found)
+		found = (p->state & state);
+	else if (!(p->state & state)) {
+		/* need to switch to waked up stack */
+		p->current_stack->flags |= SSF_RESTORE_STATE;
+		p->current_stack->state = p->state;
+		set_tsk_need_resched(p);
+	}
+
+	if (unlock)
+		__task_rq_unlock(rq);
+
+	return found;
+}
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2690,7 +2741,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 	smp_wmb();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	if (!(p->state & state))
+
+	if (!wake_up_alt_stacks(p, state))
 		goto out;
 
 	success = 1; /* we're going to change ->state */
@@ -4278,6 +4330,56 @@ pick_next_task(struct rq *rq)
 	BUG(); /* the idle class will always have a runnable task */
 }
 
+static noinline int try_alt_stack(struct task_struct *p)
+{
+	struct list_head *e;
+	bool found = false;
+	struct stack_struct *next = &p->main_stack, *prev;
+
+	p->current_stack->flags &= ~SSF_FORCE_SWITCH;
+
+	list_for_each(e, &p->stacks) {
+		next = list_entry(e, struct stack_struct, next);
+
+		if (p->current_stack == next || next->state)
+			continue;
+
+		found = true;
+		break;
+	}
+
+	/*
+	 * If current task is dead and all other stacks are sleeping
+	 * then switch to the main stack
+	 */
+	if (!found) {
+	       if (p->state == TASK_DEAD)
+		       next = &p->main_stack;
+	       else
+		       return 0;
+	}
+
+	if (next == p->current_stack)
+		return 0;
+
+	prev = switch_alt_stack(p->current_stack, next);
+
+	if (prev->state == TASK_DEAD) {
+		list_del(&prev->next);
+		if (prev->flags & SSF_AUTODELETE) {
+			deinit_alt_stack(prev);
+			kfree(prev);
+		} else
+			prev->stack_state = STACK_DEAD;
+		put_task_struct(p);
+		/* check if main stack is waiting for alt stack in exit */
+		if ((p->flags & PF_EXITING) && list_is_singular(&p->stacks))
+			p->state = TASK_RUNNING;
+	}
+
+	return 1;
+}
+
 /*
  * schedule() is the main scheduler function.
  */
@@ -4303,10 +4405,19 @@ need_resched:
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
-	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
+	if ((prev->state || (prev->current_stack->flags & SSF_FORCE_SWITCH))
+			&& !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
-		} else {
+		} else if (!list_is_singular(&prev->stacks)) {
+			if (try_alt_stack(prev)) {
+				cpu = smp_processor_id();
+				rq = cpu_rq(cpu);
+				prev = rq->curr;
+			}
+		}
+		
+		if (prev->state) {
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -4366,8 +4477,13 @@ need_resched:
 	post_schedule(rq);
 
 	preempt_enable_no_resched();
-	if (need_resched())
+	if (need_resched()) {
+		if (current->current_stack->flags & SSF_RESTORE_STATE) {
+			current->current_stack->flags &= ~SSF_RESTORE_STATE;
+			current->state = current->current_stack->state;
+		}
 		goto need_resched;
+	}
 }
 EXPORT_SYMBOL(schedule);
 
@@ -8204,6 +8320,8 @@ void __init sched_init(void)
 		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 #endif /* SMP */
 
+	list_add(&init_task.main_stack.next, &init_task.stacks);
+
 	scheduler_running = 1;
 }
 
@@ -9358,3 +9476,127 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+int init_alt_stack(struct stack_struct *stack, void (*fn)(void*), void *arg,
+		bool autodelete)
+{
+	stack->ti = alloc_thread_info_node(current, numa_node_id());
+
+	if (!stack->ti)
+		return -ENOMEM;
+
+	*(unsigned long *)(stack->ti + 1) = STACK_END_MAGIC;
+
+	stack->fn = fn;
+	stack->arg = arg;
+	stack->stack_state = STACK_DEAD;
+	stack->flags = autodelete ? SSF_AUTODELETE : 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(init_alt_stack);
+
+void launch_alt_stack(struct stack_struct *stack)
+{
+	unsigned long flags;
+
+	BUG_ON(stack->stack_state != STACK_DEAD);
+
+	*stack->ti = *task_thread_info(current);
+	arch_alt_stack_setup(stack);
+
+	stack->state = TASK_RUNNING;
+	stack->stack_state = STACK_START;
+	/* pi_lock synchronize with ttwu */
+	raw_spin_lock_irqsave(&current->pi_lock, flags);
+	list_add(&stack->next, &current->stacks);
+	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
+	get_task_struct(current);
+	if (current->current_stack == &current->main_stack) {
+		/* force switching to new stack */
+		stack->flags |= SSF_START_WAITED;
+		while (stack->stack_state == STACK_START) {
+			current->state = TASK_UNINTERRUPTIBLE;
+			schedule();
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(launch_alt_stack);
+
+int run_on_alt_stack(void (*fn)(void*), void *arg)
+{
+	int r;
+	struct stack_struct *stack = kmalloc(sizeof(*stack), GFP_KERNEL);
+
+	if (!stack)
+		return -ENOMEM;
+
+	r = init_alt_stack(stack, fn, arg, true);
+
+	if (r)
+		kfree(stack);
+	else
+		launch_alt_stack(stack);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(run_on_alt_stack);
+
+void deinit_alt_stack(struct stack_struct *stack)
+{
+	free_pages((unsigned long)stack->ti, get_order(THREAD_SIZE));
+}
+EXPORT_SYMBOL_GPL(deinit_alt_stack);
+
+NORET_TYPE void exit_alt_stack(void)
+{
+	if (current->current_stack != &current->main_stack) {
+		current->state = TASK_DEAD;
+		schedule();
+	}
+	BUG();
+	/* Avoid "noreturn function does return".  */
+	for (;;)
+		cpu_relax();    /* For when BUG is null */
+}
+EXPORT_SYMBOL_GPL(exit_alt_stack);
+
+void schedule_alt_stack_tail(struct task_struct *p)
+	__releases(rq->lock)
+{
+	raw_spin_unlock_irq(&this_rq()->lock);
+	preempt_enable();
+}
+
+NORET_TYPE void start_alt_stack(struct stack_struct *stack)
+{
+	stack->stack_state = STACK_LIVE;
+	if (stack->flags & SSF_START_WAITED) {
+		current->main_stack.state = TASK_RUNNING;
+		stack->flags &= ~SSF_START_WAITED;
+	}
+	schedule_alt_stack_tail(current);
+	stack->fn(stack->arg);
+	exit_alt_stack();
+	BUG();
+}
+
+void wait_alt_stacks(struct task_struct *tsk)
+{
+	if (current->current_stack != &current->main_stack) {
+		struct list_head *e;
+		printk(KERN_ALERT"Exit is called on alt stack. Reboot is needed\n");
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		list_for_each(e, &tsk->stacks) {
+			struct stack_struct *ss =
+				list_entry(e, struct stack_struct, next);
+			if (tsk->current_stack != ss)
+				ss->state = TASK_UNINTERRUPTIBLE;
+			schedule();
+		}
+	}
+
+	while(!list_is_singular(&tsk->stacks)) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule();
+	}
+}
--
			Gleb.

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 15:45                 ` Gleb Natapov
@ 2011-08-31 16:02                   ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2011-08-31 16:02 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Andrew Morton, Daniel Ehrenberg, Jens Axboe, Jeff Moyer,
	linux-kernel, linux-aio

On 08/31/2011 06:45 PM, Gleb Natapov wrote:
> KVM also have similar needs. KVM has x86 emulator in kernel which is,
> in fact, a state machines that sometimes need an input from userspace
> to proceed.  Currently, when userspace input is needed, KVM goes back
> to userspace to retrieve the input and than retries the emulation. Some
> instructions may require several such iterations. This is somewhat similar
> to aio except that in KVM case emulation waits for userspace instead of
> disk/network HW. The resulting code is complex and error prone. It would
> be nice to not have to unwind the stack from the middle of the emulator
> just to be able to exit to userspace to retrieve the value. One idea that
> came up was to execute emulator on separate kernel stack (withing same
> task). When emulator needs a value from userspace it sleeps while main
> stack goes to userspace to get the value. When the value is available
> main stack wakes up emulator stack and emulation continues from the
> place it was stopped. Cooperative multithreading inside the kernel
> if you want. Bellow is the patch I prototyped to implement that on
> x86_64. I made KVM x86 emulator to use it too. I think AIO can use the
> same technique. io_submit will execute IO on alternative stack. If it
> blocks main thread will continue to run. When IO is completed IO stack
> will resume (alternative stack has priority over main stack).
>

Note that kvm has a significant interest in linux-aio as well - we see a 
significant performance win when we can use it.  From my point of view 
extending linux-aio to be truly asynchronous in all cases is the bigger 
win here, the emulator issue is a nice code cleanup but we could live 
without it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 11:04                     ` Ulrich Drepper
@ 2011-08-31 16:59                       ` Jeremy Allison
  2011-09-01 11:14                         ` Ulrich Drepper
  0 siblings, 1 reply; 41+ messages in thread
From: Jeremy Allison @ 2011-08-31 16:59 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Andrew Morton, Jeremy Allison, Daniel Ehrenberg, Jens Axboe,
	Jeff Moyer, linux-kernel, linux-aio

On Wed, Aug 31, 2011 at 07:04:10AM -0400, Ulrich Drepper wrote:
> On 08/30/2011 07:11 PM, Andrew Morton wrote:
> > I don't know.  Uli cc'ed.
> 
> glibc has to create the parallelism of the operations through threads.
> More threads mean more overhead.  There are insane people out there who
> wrote code which pushed the number of helper threads into the hundreds
> (at that time a high number, today this would be in the thousands).
> Anyway, any scheme which is worth changing the code for would be in the
> kernel.  Only the kernel knows which requests can actually be handled
> concurrently.

I get that, but isn't that what the aio_init(const struct aioinit *init) call is meant to
solve ?

After all:

struct aioinit
  {
    int aio_threads;            /* Maximal number of threads.  */
    int aio_num;                /* Number of expected simultanious requests. */
    int aio_locks;              /* Not used.  */
    int aio_usedba;             /* Not used.  */
    int aio_debug;              /* Not used.  */
    int aio_numusers;           /* Not used.  */
    int aio_idle_time;          /* Number of seconds before idle thread
                                   terminates.  */
    int aio_reserved;
  };

Would seem to be the existing way to limit this. What I don't understand
is why you restrict the pthread aio implementation to only allow
*one* outstanding request per file descriptor ?

Surely it doesn't matter what fd the requests are outstanding on, isn't
it the total number of outstanding threads that really matter ?

By limiting this to one outstanding request per fd, this prevents
any parallelization for requests on a single file for a file server
written to use the aio interface in glibc. Windows SMB2 client issue
multiple simultaneous asynchronous requests on a single file, and
the current glibc implementation throttles them to be linear.

Volker had to create a vfs_aio_fork Samba VFS module that uses
forked processes and shared memory communication to get around
this exact problem. It does improve performance for SMB2 clients,
but means we have many more processes than we'd otherwise need
if we didn't have the restriction.

Do you have benchmark data that shows that limiting outstanding requests
to one per fd is a performance win ?

Do you want to see my patch to glibc that removes this restriction
so we can test this on any benchmarks you used to decide on the
one-request-per-fd restriction ?

Jeremy.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31  5:26     ` Christoph Hellwig
@ 2011-08-31 17:08       ` Andi Kleen
  2011-08-31 21:00         ` Daniel Ehrenberg
  2011-09-01  4:18         ` Dave Chinner
  2011-09-01  3:39       ` Dave Chinner
  1 sibling, 2 replies; 41+ messages in thread
From: Andi Kleen @ 2011-08-31 17:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Daniel Ehrenberg, linux-kernel

Christoph Hellwig <hch@infradead.org> writes:
>
> I'll get it polished up and send it out for RFC once Dave sends out
> the updated allocation workqueue patch.  With this he moves all
> allocator calls in XFS into a workqueue.  My direct I/O patch uses that
> fact to use that workqueue for the allocator call

Is that really a good direction? The problem when you push operations
from multiple threads all into a single resource (per cpu workqueue)
is that the CPU scheduler loses control over that because they
are all mixed up.

So if one guy submits a lot and another very little the "a lot" guy
can overwhelm the queue for the very little guy.

We also have similar problems with the IO schedulers, which also
rely on process context to make fairness decisions. If you remove
the process context they do badly.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 17:08       ` Andi Kleen
@ 2011-08-31 21:00         ` Daniel Ehrenberg
  2011-08-31 21:15           ` Andi Kleen
  2011-09-01  4:18         ` Dave Chinner
  1 sibling, 1 reply; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-31 21:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Hellwig, linux-kernel

On Wed, Aug 31, 2011 at 10:08 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Christoph Hellwig <hch@infradead.org> writes:
>>
>> I'll get it polished up and send it out for RFC once Dave sends out
>> the updated allocation workqueue patch.  With this he moves all
>> allocator calls in XFS into a workqueue.  My direct I/O patch uses that
>> fact to use that workqueue for the allocator call
>
> Is that really a good direction? The problem when you push operations
> from multiple threads all into a single resource (per cpu workqueue)
> is that the CPU scheduler loses control over that because they
> are all mixed up.
>
> So if one guy submits a lot and another very little the "a lot" guy
> can overwhelm the queue for the very little guy.
>
> We also have similar problems with the IO schedulers, which also
> rely on process context to make fairness decisions. If you remove
> the process context they do badly.
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only
>

This objection would seem to point to the benefits of doing something
like Suparna's patches, with a wait queue per task, no? This preserves
the current regime where each thread calling io_submit ends up
submitting things in parallel.

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 21:00         ` Daniel Ehrenberg
@ 2011-08-31 21:15           ` Andi Kleen
  0 siblings, 0 replies; 41+ messages in thread
From: Andi Kleen @ 2011-08-31 21:15 UTC (permalink / raw)
  To: Daniel Ehrenberg; +Cc: Andi Kleen, Christoph Hellwig, linux-kernel

> This objection would seem to point to the benefits of doing something
> like Suparna's patches, with a wait queue per task, no? This preserves
> the current regime where each thread calling io_submit ends up
> submitting things in parallel.

Yep, Suparna's code didn't have that problem.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31  6:04                 ` guy keren
@ 2011-08-31 23:16                   ` Daniel Ehrenberg
  2011-08-31 23:48                     ` guy keren
  0 siblings, 1 reply; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-31 23:16 UTC (permalink / raw)
  To: guy keren; +Cc: Andrew Morton, Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Tue, Aug 30, 2011 at 11:04 PM, guy keren <choo@actcom.co.il> wrote:
> On Tue, 2011-08-30 at 15:54 -0700, Andrew Morton wrote:
>> On Tue, 30 Aug 2011 15:45:35 -0700
>> Daniel Ehrenberg <dehrenberg@google.com> wrote:
>>
>> > >> Not quite sure, and after working on them and fixing thing up, I don't
>> > >> even think they are that complex or intrusive (which I think otherwise
>> > >> would've been the main objection). Andrew may know/remember.
>> > >
>> > > Boy, that was a long time ago. __I was always unhappy with the patches
>> > > because of the amount of additional code/complexity they added.
>> > >
>> > > Then the great syslets/threadlets design session happened and it was
>> > > expected that such a facility would make special async handling for AIO
>> > > unnecessary. __Then syslets/threadlets didn't happen.
>> >
>> > Do you think we could accomplish the goals with less additional
>> > code/complexity? It looks like the latest version of the patch set
>> > wasn't so invasive.
>> >
>> > If syslets/threadlets aren't happening, should these patches be
>> > reconsidered for inclusion in the kernel?
>>
>> I haven't seen any demand at all for the feature in many years.  That
>> doesn't mean that there _isn't_ any demand - perhaps everyone got
>> exhausted.
>
> you should consider the emerging enterprise-grade SSD devices - which
> can serve several tens of thousands of I/O requests per device actually
> controller). These devices could be better utilized by better
> interfaces. further more, in our company we had to resort to using
> windows for IOPS benchmarking (using iometer) against storage systems
> using these (and similar) devices, because it manages to generate higher
> IOPS then linux can (i don't remember the exact numbers, but we are
> talking about an order of several hundred thousands IOPS).
>
> It could be that we are currently an esoteric use-case - but the
> high-end performance market seems to be stepping in that direction.

I'm interested in SSD performance too. Could you tell me more about
your use case? Were you using a file system or a raw block device? The
patches we're discussing don't have any effect on a raw block device.
Do you have any particular ideas about a new interface? What does
Windows provide that Linux lacks that's relevant here?
>
>> If there is demand then that should be described and circulated, see
>> how much interest there is in resurrecting the effort.
>>
>> And, of course, the patches should be dragged out and looked at - it's
>> been a number of years now.
>>
>> Also, glibc has userspace for POSIX AIO.  A successful kernel-based
>> implementation would result in glibc migrating away from its current
>> implementation.  So we should work with the glibc developers on ensuring
>> that the migration can happen.
>
> glibc's userspace implementation doesn't scale to fast devices. It could
> make sense when working with slower disk devices - not when you're
> working with solid-state storage devices.
>
> --guy
>
>

Dan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 23:16                   ` Daniel Ehrenberg
@ 2011-08-31 23:48                     ` guy keren
  2011-08-31 23:59                       ` Daniel Ehrenberg
  0 siblings, 1 reply; 41+ messages in thread
From: guy keren @ 2011-08-31 23:48 UTC (permalink / raw)
  To: Daniel Ehrenberg
  Cc: Andrew Morton, Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Wed, 2011-08-31 at 16:16 -0700, Daniel Ehrenberg wrote:
> On Tue, Aug 30, 2011 at 11:04 PM, guy keren <choo@actcom.co.il> wrote:
> > On Tue, 2011-08-30 at 15:54 -0700, Andrew Morton wrote:
> >> On Tue, 30 Aug 2011 15:45:35 -0700
> >> Daniel Ehrenberg <dehrenberg@google.com> wrote:
> >>
> >> > >> Not quite sure, and after working on them and fixing thing up, I don't
> >> > >> even think they are that complex or intrusive (which I think otherwise
> >> > >> would've been the main objection). Andrew may know/remember.
> >> > >
> >> > > Boy, that was a long time ago. __I was always unhappy with the patches
> >> > > because of the amount of additional code/complexity they added.
> >> > >
> >> > > Then the great syslets/threadlets design session happened and it was
> >> > > expected that such a facility would make special async handling for AIO
> >> > > unnecessary. __Then syslets/threadlets didn't happen.
> >> >
> >> > Do you think we could accomplish the goals with less additional
> >> > code/complexity? It looks like the latest version of the patch set
> >> > wasn't so invasive.
> >> >
> >> > If syslets/threadlets aren't happening, should these patches be
> >> > reconsidered for inclusion in the kernel?
> >>
> >> I haven't seen any demand at all for the feature in many years.  That
> >> doesn't mean that there _isn't_ any demand - perhaps everyone got
> >> exhausted.
> >
> > you should consider the emerging enterprise-grade SSD devices - which
> > can serve several tens of thousands of I/O requests per device actually
> > controller). These devices could be better utilized by better
> > interfaces. further more, in our company we had to resort to using
> > windows for IOPS benchmarking (using iometer) against storage systems
> > using these (and similar) devices, because it manages to generate higher
> > IOPS then linux can (i don't remember the exact numbers, but we are
> > talking about an order of several hundred thousands IOPS).
> >
> > It could be that we are currently an esoteric use-case - but the
> > high-end performance market seems to be stepping in that direction.
> 
> I'm interested in SSD performance too. Could you tell me more about
> your use case? Were you using a file system or a raw block device? The
> patches we're discussing don't have any effect on a raw block device.

well, the use case i've discussed specifically was with raw devices -
not file systems.

for file systems info - i'll have to consult the people that were
running benchmarks at our work place.

> Do you have any particular ideas about a new interface? What does
> Windows provide that Linux lacks that's relevant here?

i don't know what exactly it provides that linux does not - basically,it
provides a similar asynchronous I/O API (using a mechanism they call
"completion ports") - it just seems that they have a faster
implementation (we compare execution on the same box, with 8GBps
fiber-channel connections, and when we are comparing IOPS - not
bandwidth nor latency. the storage device is the product that we
manufacture - which is based on DRAM for storage).

i can't tell you what's the specific part that causes the performance
differences - the AIO implementation, the multi-path driver or something
else.

internally inside the box, we had problems when attempting to recover
after a disconnection - back when we used iscsi as our internal
transport. we stopped using it - so this is not relevant for us - but
the phenomena we saw was that at certain times, when we had many (a few
tens) of AIO operations to perform at once - it could take several
seconds just to send them all (i'm not talking about completion). this
was when we used the POSIX API on top of linux's AIO implementation
(i.e. using librtkaio - not using the user-space implementation of
glibc).

> >
> >> If there is demand then that should be described and circulated, see
> >> how much interest there is in resurrecting the effort.
> >>
> >> And, of course, the patches should be dragged out and looked at - it's
> >> been a number of years now.
> >>
> >> Also, glibc has userspace for POSIX AIO.  A successful kernel-based
> >> implementation would result in glibc migrating away from its current
> >> implementation.  So we should work with the glibc developers on ensuring
> >> that the migration can happen.
> >
> > glibc's userspace implementation doesn't scale to fast devices. It could
> > make sense when working with slower disk devices - not when you're
> > working with solid-state storage devices.
> >
> > --guy
> >
> >
> 
> Dan

--guy


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 23:48                     ` guy keren
@ 2011-08-31 23:59                       ` Daniel Ehrenberg
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Ehrenberg @ 2011-08-31 23:59 UTC (permalink / raw)
  To: guy keren; +Cc: Andrew Morton, Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Wed, Aug 31, 2011 at 4:48 PM, guy keren <choo@actcom.co.il> wrote:
> On Wed, 2011-08-31 at 16:16 -0700, Daniel Ehrenberg wrote:
>> On Tue, Aug 30, 2011 at 11:04 PM, guy keren <choo@actcom.co.il> wrote:
>> > On Tue, 2011-08-30 at 15:54 -0700, Andrew Morton wrote:
>> >> On Tue, 30 Aug 2011 15:45:35 -0700
>> >> Daniel Ehrenberg <dehrenberg@google.com> wrote:
>> >>
>> >> > >> Not quite sure, and after working on them and fixing thing up, I don't
>> >> > >> even think they are that complex or intrusive (which I think otherwise
>> >> > >> would've been the main objection). Andrew may know/remember.
>> >> > >
>> >> > > Boy, that was a long time ago. __I was always unhappy with the patches
>> >> > > because of the amount of additional code/complexity they added.
>> >> > >
>> >> > > Then the great syslets/threadlets design session happened and it was
>> >> > > expected that such a facility would make special async handling for AIO
>> >> > > unnecessary. __Then syslets/threadlets didn't happen.
>> >> >
>> >> > Do you think we could accomplish the goals with less additional
>> >> > code/complexity? It looks like the latest version of the patch set
>> >> > wasn't so invasive.
>> >> >
>> >> > If syslets/threadlets aren't happening, should these patches be
>> >> > reconsidered for inclusion in the kernel?
>> >>
>> >> I haven't seen any demand at all for the feature in many years.  That
>> >> doesn't mean that there _isn't_ any demand - perhaps everyone got
>> >> exhausted.
>> >
>> > you should consider the emerging enterprise-grade SSD devices - which
>> > can serve several tens of thousands of I/O requests per device actually
>> > controller). These devices could be better utilized by better
>> > interfaces. further more, in our company we had to resort to using
>> > windows for IOPS benchmarking (using iometer) against storage systems
>> > using these (and similar) devices, because it manages to generate higher
>> > IOPS then linux can (i don't remember the exact numbers, but we are
>> > talking about an order of several hundred thousands IOPS).
>> >
>> > It could be that we are currently an esoteric use-case - but the
>> > high-end performance market seems to be stepping in that direction.
>>
>> I'm interested in SSD performance too. Could you tell me more about
>> your use case? Were you using a file system or a raw block device? The
>> patches we're discussing don't have any effect on a raw block device.
>
> well, the use case i've discussed specifically was with raw devices -
> not file systems.
>
> for file systems info - i'll have to consult the people that were
> running benchmarks at our work place.
>
>> Do you have any particular ideas about a new interface? What does
>> Windows provide that Linux lacks that's relevant here?
>
> i don't know what exactly it provides that linux does not - basically,it
> provides a similar asynchronous I/O API (using a mechanism they call
> "completion ports") - it just seems that they have a faster
> implementation (we compare execution on the same box, with 8GBps
> fiber-channel connections, and when we are comparing IOPS - not
> bandwidth nor latency. the storage device is the product that we
> manufacture - which is based on DRAM for storage).
>
> i can't tell you what's the specific part that causes the performance
> differences - the AIO implementation, the multi-path driver or something
> else.
>
> internally inside the box, we had problems when attempting to recover
> after a disconnection - back when we used iscsi as our internal
> transport. we stopped using it - so this is not relevant for us - but
> the phenomena we saw was that at certain times, when we had many (a few
> tens) of AIO operations to perform at once - it could take several
> seconds just to send them all (i'm not talking about completion). this
> was when we used the POSIX API on top of linux's AIO implementation
> (i.e. using librtkaio - not using the user-space implementation of
> glibc).

I'm just as interested in improving the performance of the raw block
device as I am of the file system. Any more details you could give me
about this would be great. You're saying io_submit on a raw block
device blocked for tens of seconds? Did your POSIX AIO implementation
make sure not to overrun the queue length established in io_setup?
Could you provide the test code you used? Do you have function-level
CPU profiles available?
>
>> >
>> >> If there is demand then that should be described and circulated, see
>> >> how much interest there is in resurrecting the effort.
>> >>
>> >> And, of course, the patches should be dragged out and looked at - it's
>> >> been a number of years now.
>> >>
>> >> Also, glibc has userspace for POSIX AIO.  A successful kernel-based
>> >> implementation would result in glibc migrating away from its current
>> >> implementation.  So we should work with the glibc developers on ensuring
>> >> that the migration can happen.
>> >
>> > glibc's userspace implementation doesn't scale to fast devices. It could
>> > make sense when working with slower disk devices - not when you're
>> > working with solid-state storage devices.
>> >
>> > --guy
>> >
>> >
>>
>> Dan
>
> --guy
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31  5:26     ` Christoph Hellwig
  2011-08-31 17:08       ` Andi Kleen
@ 2011-09-01  3:39       ` Dave Chinner
  2011-09-01  4:20         ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2011-09-01  3:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Daniel Ehrenberg, linux-kernel

On Wed, Aug 31, 2011 at 01:26:27AM -0400, Christoph Hellwig wrote:
> On Tue, Aug 30, 2011 at 02:51:01PM -0700, Daniel Ehrenberg wrote:
> > > Let filesystems handle this. ?I've actually prototyped it in XFS,
> > > based on some pending work from Dave but at this point it's still butt
> > > ugly.
> > 
> > Great, would you be willing to let me see the draft code?
> > 
> > Are you sure that there wouldn't be any benefit to having the code be
> > in the aio/dio levels in terms of making it easier for file
> > systems/reducing code duplication?
> 
> I'll get it polished up and send it out for RFC once Dave sends out
> the updated allocation workqueue patch.  With this he moves all
> allocator calls in XFS into a workqueue.  My direct I/O patch uses that
> fact to use that workqueue for the allocator call and let the existing
> aio retry infrastructure retry the direct I/O operation one that
> workqueue has finished.

I thought you didn't like that code, Christoph. ;)

I'll resend it soon.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 17:08       ` Andi Kleen
  2011-08-31 21:00         ` Daniel Ehrenberg
@ 2011-09-01  4:18         ` Dave Chinner
  2011-09-01  4:39           ` Andi Kleen
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2011-09-01  4:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Hellwig, Daniel Ehrenberg, linux-kernel

On Wed, Aug 31, 2011 at 10:08:50AM -0700, Andi Kleen wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> >
> > I'll get it polished up and send it out for RFC once Dave sends out
> > the updated allocation workqueue patch.  With this he moves all
> > allocator calls in XFS into a workqueue.  My direct I/O patch uses that
> > fact to use that workqueue for the allocator call
> 
> Is that really a good direction? The problem when you push operations
> from multiple threads all into a single resource (per cpu workqueue)
> is that the CPU scheduler loses control over that because they
> are all mixed up.

Allocations are already serialised by a single resource - the AGF
lock - so whether they block on the workqueue queue or on the AGF
lock is irrelevant to scheduling. And a single thread can only have
a single allocation outstanding at a time because the caller has to
block waiting for the allocation to complete before moving on. 

> So if one guy submits a lot and another very little the "a lot" guy
> can overwhelm the queue for the very little guy.

If we get lots of allocations queued on the one per-CPU wq, they
will have all had to come from different contexts. In which case,
FIFO processing of the work queued up is *exactly* the fairness we
want, because that is exactly what doing them from process context
would end up with.

If the allocation work blocks (either on locks or metadata reads),
the workqueue is configured with a deep amount of concurrent
operations per CPU (I think I set it to the maximum of 512 works per
CPU), so other pending allocations from the same per-cpu workqueue
can be run in the mean time.

> We also have similar problems with the IO schedulers, which also
> rely on process context to make fairness decisions. If you remove
> the process context they do badly.

Which, IMO, is a significant failing of the IO scheduler in question
(CFQ) because it'll perform badly the moment your application or
filesytem uses a multithreaded IO architecture.  Filesystem metadata
is a global resource, not a per-process context resource, so IO
scehdulers need to treat it that way.

Indeed, taking the allocation IO out of the process context means
the filesystem operations are not subject to process context based
throttling, which can lead to priority inversion problems when a
low priority process is throttled on a metadata read IO needed to
complete an allocation that a high priority process is waiting on
being completed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01  3:39       ` Dave Chinner
@ 2011-09-01  4:20         ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2011-09-01  4:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Daniel Ehrenberg, linux-kernel

On Thu, Sep 01, 2011 at 01:39:02PM +1000, Dave Chinner wrote:
> > I'll get it polished up and send it out for RFC once Dave sends out
> > the updated allocation workqueue patch.  With this he moves all
> > allocator calls in XFS into a workqueue.  My direct I/O patch uses that
> > fact to use that workqueue for the allocator call and let the existing
> > aio retry infrastructure retry the direct I/O operation one that
> > workqueue has finished.
> 
> I thought you didn't like that code, Christoph. ;)

I still don't like it, but in for this case it actually was helpful.
You'll always have to see the positive side of things.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01  4:18         ` Dave Chinner
@ 2011-09-01  4:39           ` Andi Kleen
  2011-09-01  6:54             ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2011-09-01  4:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Christoph Hellwig, Daniel Ehrenberg, linux-kernel

> Allocations are already serialised by a single resource - the AGF
> lock - so whether they block on the workqueue queue or on the AGF
> lock is irrelevant to scheduling. And a single thread can only have

It's not about blocking, but about who gets the work accounted
when it is done.

> If we get lots of allocations queued on the one per-CPU wq, they
> will have all had to come from different contexts. In which case,
> FIFO processing of the work queued up is *exactly* the fairness we
> want, because that is exactly what doing them from process context
> would end up with.

You want the work accounted to the originator so that it can be
slowed down when it does too much (e.g. hit its cgroups or CFQ limits)

Networking learned these lessons a long time ago, it's much 
better for overload behavior when as much as possible is done in
process context.

-Andi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01  4:39           ` Andi Kleen
@ 2011-09-01  6:54             ` Dave Chinner
  2011-09-02 13:08               ` Ted Ts'o
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2011-09-01  6:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Hellwig, Daniel Ehrenberg, linux-kernel

On Thu, Sep 01, 2011 at 06:39:47AM +0200, Andi Kleen wrote:
> > Allocations are already serialised by a single resource - the AGF
> > lock - so whether they block on the workqueue queue or on the AGF
> > lock is irrelevant to scheduling. And a single thread can only have
> 
> It's not about blocking, but about who gets the work accounted
> when it is done.

Don't trim the context away and respond with a completely different
argument that is irrelevant to the original context!  Accounting who
did the work is irrelevant to the discussion context of the fairness
of queuing and dispatching synchronous allocations via a FIFO wq
implementation....

> > If we get lots of allocations queued on the one per-CPU wq, they
> > will have all had to come from different contexts. In which case,
> > FIFO processing of the work queued up is *exactly* the fairness we
> > want, because that is exactly what doing them from process context
> > would end up with.
> 
> You want the work accounted to the originator so that it can be
> slowed down when it does too much (e.g. hit its cgroups or CFQ limits)

So fix the workqueue infrastructure to track it properly. Don't keep
bringing this up as a reason for saying moving work to workqueues is
bad.

> Networking learned these lessons a long time ago, it's much 
> better for overload behavior when as much as possible is done in
> process context.

Apples to oranges - there's orders of magnitude of difference in the
number of operations that the different stacks do. Allocation in XFS
when it does not block can still take milliseconds of CPU time; in
comparison, the networking stack is expected to process thousands of
packets in that same time frame.  IOWs, the scale of processing per
item of work is -vastly- different - that's why working in process
context matters a great deal to the networking stack but not to
allocation in XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-08-31 16:59                       ` Jeremy Allison
@ 2011-09-01 11:14                         ` Ulrich Drepper
  2011-09-01 15:58                           ` Jeremy Allison
  0 siblings, 1 reply; 41+ messages in thread
From: Ulrich Drepper @ 2011-09-01 11:14 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Andrew Morton, Daniel Ehrenberg, Jens Axboe, Jeff Moyer,
	linux-kernel, linux-aio

[-- Attachment #1: Type: text/plain, Size: 735 bytes --]

On 08/31/2011 12:59 PM, Jeremy Allison wrote:
> I get that, but isn't that what the aio_init(const struct aioinit *init) call is meant to
> solve ?

The problem cannot be solved by something that trivial.  Any thread can
be delayed indefinitely.  If this happens for a file descriptor chances
are that all threads for the same file descriptor are affected while
there is I/O for all the other file descriptors ready to run.  I don't
say this is anywhere near optimal or even good, at least it doesn't
amplify problems.  If you know you want more parallelism on the same
file descriptor, dup it.  If you don't want anyone like me implementing
stupid limitations finally fix the kernel aio interface so that it is
usable.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 11:14                         ` Ulrich Drepper
@ 2011-09-01 15:58                           ` Jeremy Allison
  2011-09-01 16:04                             ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Jeremy Allison @ 2011-09-01 15:58 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Jeremy Allison, Andrew Morton, Daniel Ehrenberg, Jens Axboe,
	Jeff Moyer, linux-kernel, linux-aio

On Thu, Sep 01, 2011 at 07:14:20AM -0400, Ulrich Drepper wrote:
> On 08/31/2011 12:59 PM, Jeremy Allison wrote:
> > I get that, but isn't that what the aio_init(const struct aioinit *init) call is meant to
> > solve ?
> 
> The problem cannot be solved by something that trivial.  Any thread can
> be delayed indefinitely.  If this happens for a file descriptor chances
> are that all threads for the same file descriptor are affected while
> there is I/O for all the other file descriptors ready to run.  I don't
> say this is anywhere near optimal or even good, at least it doesn't
> amplify problems.  If you know you want more parallelism on the same
> file descriptor, dup it.

Yes I did consider that of course. Problem is that leads you to the
nightmare that is losing all fcntl locks on the file when any of the
descriptors are closed. Of course we already have internal work arounds
for that - but they're not scalable in this case.  We'd have to dup on
every read/write, and because of the fcntl lock problem we have to keep
all fd's around until the final close of the file. Don't tell us to
implement our own locking instead because (a) we already do in the case
where we don't need locking consistency with NFS and (b) most vendors insist on
locking consistency with NFS - not good if locks on one protocol aren't
seen by another.

> If you don't want anyone like me implementing
> stupid limitations finally fix the kernel aio interface so that it is
> usable.

If you know the limitation is stupid then one wonders why you added it
in the first place :-). Whatever happened to "the application writer
knows best what they are trying to do" and let us hang ourselves ?

I agree the kernel aio interface is unusable from applications, no
arguments about that.

Jeremy.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 15:58                           ` Jeremy Allison
@ 2011-09-01 16:04                             ` Christoph Hellwig
  2011-09-01 16:15                               ` Jeremy Allison
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2011-09-01 16:04 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Ulrich Drepper, Andrew Morton, Daniel Ehrenberg, Jens Axboe,
	Jeff Moyer, linux-kernel, linux-aio

On Thu, Sep 01, 2011 at 08:58:45AM -0700, Jeremy Allison wrote:
> Yes I did consider that of course. Problem is that leads you to the
> nightmare that is losing all fcntl locks on the file when any of the
> descriptors are closed. Of course we already have internal work arounds
> for that - but they're not scalable in this case.  We'd have to dup on
> every read/write, and because of the fcntl lock problem we have to keep
> all fd's around until the final close of the file. Don't tell us to
> implement our own locking instead because (a) we already do in the case
> where we don't need locking consistency with NFS and (b) most vendors insist on
> locking consistency with NFS - not good if locks on one protocol aren't
> seen by another.

We could easily give you an fcntl / dup3 flag to only release posix
locks on the final close of a struct file if that helps you.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:04                             ` Christoph Hellwig
@ 2011-09-01 16:15                               ` Jeremy Allison
  2011-09-01 16:23                                 ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Jeremy Allison @ 2011-09-01 16:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Ulrich Drepper, Andrew Morton, Daniel Ehrenberg,
	Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Thu, Sep 01, 2011 at 12:04:36PM -0400, Christoph Hellwig wrote:
> On Thu, Sep 01, 2011 at 08:58:45AM -0700, Jeremy Allison wrote:
> > Yes I did consider that of course. Problem is that leads you to the
> > nightmare that is losing all fcntl locks on the file when any of the
> > descriptors are closed. Of course we already have internal work arounds
> > for that - but they're not scalable in this case.  We'd have to dup on
> > every read/write, and because of the fcntl lock problem we have to keep
> > all fd's around until the final close of the file. Don't tell us to
> > implement our own locking instead because (a) we already do in the case
> > where we don't need locking consistency with NFS and (b) most vendors insist on
> > locking consistency with NFS - not good if locks on one protocol aren't
> > seen by another.
> 
> We could easily give you an fcntl / dup3 flag to only release posix
> locks on the final close of a struct file if that helps you.

That would help us enormously - it'd be Linux only of course but
we could easily add support for that.

Can you propose the design here so we can run it past some of the
Solaris/FreeBSD folks (it'd be nice if we could get broader adoption) ?

Jeremy.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:15                               ` Jeremy Allison
@ 2011-09-01 16:23                                 ` Christoph Hellwig
  2011-09-01 16:31                                   ` Jeremy Allison
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2011-09-01 16:23 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Ulrich Drepper, Andrew Morton,
	Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel,
	linux-aio

On Thu, Sep 01, 2011 at 09:15:31AM -0700, Jeremy Allison wrote:
> > We could easily give you an fcntl / dup3 flag to only release posix
> > locks on the final close of a struct file if that helps you.
> 
> That would help us enormously - it'd be Linux only of course but
> we could easily add support for that.
> 
> Can you propose the design here so we can run it past some of the
> Solaris/FreeBSD folks (it'd be nice if we could get broader adoption) ?

Not sure there is all that much to discuss.  The idea is to have locks
that behave like Posix locks, but only get release when the last duped
fd to them gets released.

We'd define a new O_LOCKS_WHATEVER flag for it, which gets set either
using fcntl(..., F_SETFL, ...) or dup3.  All in all that should be less
than 50 lines of code in the kernel.

The alternative would be to design a different lock type, but that would
be a lot more invasive, and not provide any real benefits.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:23                                 ` Christoph Hellwig
@ 2011-09-01 16:31                                   ` Jeremy Allison
  2011-09-01 16:34                                     ` Christoph Hellwig
  2011-09-01 16:34                                     ` Jeremy Allison
  0 siblings, 2 replies; 41+ messages in thread
From: Jeremy Allison @ 2011-09-01 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Ulrich Drepper, Andrew Morton, Daniel Ehrenberg,
	Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Thu, Sep 01, 2011 at 12:23:37PM -0400, Christoph Hellwig wrote:
> On Thu, Sep 01, 2011 at 09:15:31AM -0700, Jeremy Allison wrote:
> > > We could easily give you an fcntl / dup3 flag to only release posix
> > > locks on the final close of a struct file if that helps you.
> > 
> > That would help us enormously - it'd be Linux only of course but
> > we could easily add support for that.
> > 
> > Can you propose the design here so we can run it past some of the
> > Solaris/FreeBSD folks (it'd be nice if we could get broader adoption) ?
> 
> Not sure there is all that much to discuss.  The idea is to have locks
> that behave like Posix locks, but only get release when the last duped
> fd to them gets released.
> 
> We'd define a new O_LOCKS_WHATEVER flag for it, which gets set either
> using fcntl(..., F_SETFL, ...) or dup3.  All in all that should be less
> than 50 lines of code in the kernel.

Ok, so it'd be set at open() time, say:

O_CLOLOCK_PERSIST

(to match the naming of something like O_CLOEXEC) and be available to set
with F_SETFD via an fcntl and dup3 call ?

> The alternative would be to design a different lock type, but that would
> be a lot more invasive, and not provide any real benefits.

No, we don't want that thanks :-).

Jeremy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:31                                   ` Jeremy Allison
@ 2011-09-01 16:34                                     ` Christoph Hellwig
  2011-09-01 16:34                                     ` Jeremy Allison
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2011-09-01 16:34 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Ulrich Drepper, Andrew Morton,
	Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel,
	linux-aio

On Thu, Sep 01, 2011 at 09:31:07AM -0700, Jeremy Allison wrote:
> > We'd define a new O_LOCKS_WHATEVER flag for it, which gets set either
> > using fcntl(..., F_SETFL, ...) or dup3.  All in all that should be less
> > than 50 lines of code in the kernel.
> 
> Ok, so it'd be set at open() time, say:
> 
> O_CLOLOCK_PERSIST
> 
> (to match the naming of something like O_CLOEXEC) and be available to set
> with F_SETFD via an fcntl and dup3 call ?

Yes, that's the idea.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:31                                   ` Jeremy Allison
  2011-09-01 16:34                                     ` Christoph Hellwig
@ 2011-09-01 16:34                                     ` Jeremy Allison
  2011-09-01 16:45                                       ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Jeremy Allison @ 2011-09-01 16:34 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Ulrich Drepper, Andrew Morton,
	Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel,
	linux-aio

On Thu, Sep 01, 2011 at 09:31:07AM -0700, Jeremy Allison wrote:
> On Thu, Sep 01, 2011 at 12:23:37PM -0400, Christoph Hellwig wrote:
> > On Thu, Sep 01, 2011 at 09:15:31AM -0700, Jeremy Allison wrote:
> > > > We could easily give you an fcntl / dup3 flag to only release posix
> > > > locks on the final close of a struct file if that helps you.
> > > 
> > > That would help us enormously - it'd be Linux only of course but
> > > we could easily add support for that.
> > > 
> > > Can you propose the design here so we can run it past some of the
> > > Solaris/FreeBSD folks (it'd be nice if we could get broader adoption) ?
> > 
> > Not sure there is all that much to discuss.  The idea is to have locks
> > that behave like Posix locks, but only get release when the last duped
> > fd to them gets released.
> > 
> > We'd define a new O_LOCKS_WHATEVER flag for it, which gets set either
> > using fcntl(..., F_SETFL, ...) or dup3.  All in all that should be less
> > than 50 lines of code in the kernel.
> 
> Ok, so it'd be set at open() time, say:
> 
> O_CLOLOCK_PERSIST
> 
> (to match the naming of something like O_CLOEXEC) and be available to set
> with F_SETFD via an fcntl and dup3 call ?

Ah, looking at fcntl - do you want to set/get this via F_SETFD/F_GETFD,
or via F_SETFL/F_GETFL ? i.e. is this a file descritor flag, or a status
flag ? I'd guess a file descriptor flag but I'm not sure of the difference
here..

Jeremy.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:34                                     ` Jeremy Allison
@ 2011-09-01 16:45                                       ` Christoph Hellwig
  2011-09-01 16:57                                         ` Jeremy Allison
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2011-09-01 16:45 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Ulrich Drepper, Andrew Morton,
	Daniel Ehrenberg, Jens Axboe, Jeff Moyer, linux-kernel,
	linux-aio

On Thu, Sep 01, 2011 at 09:34:52AM -0700, Jeremy Allison wrote:
> > 
> > (to match the naming of something like O_CLOEXEC) and be available to set
> > with F_SETFD via an fcntl and dup3 call ?
> 
> Ah, looking at fcntl - do you want to set/get this via F_SETFD/F_GETFD,
> or via F_SETFL/F_GETFL ? i.e. is this a file descritor flag, or a status
> flag ? I'd guess a file descriptor flag but I'm not sure of the difference
> here..

F_SETFD/F_GETFD operates on an entry in the fd table, e.g. separately
if you have duped fds.  That's exactly what we do _not_ want here.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01 16:45                                       ` Christoph Hellwig
@ 2011-09-01 16:57                                         ` Jeremy Allison
  0 siblings, 0 replies; 41+ messages in thread
From: Jeremy Allison @ 2011-09-01 16:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Ulrich Drepper, Andrew Morton, Daniel Ehrenberg,
	Jens Axboe, Jeff Moyer, linux-kernel, linux-aio

On Thu, Sep 01, 2011 at 12:45:54PM -0400, Christoph Hellwig wrote:
> On Thu, Sep 01, 2011 at 09:34:52AM -0700, Jeremy Allison wrote:
> > > 
> > > (to match the naming of something like O_CLOEXEC) and be available to set
> > > with F_SETFD via an fcntl and dup3 call ?
> > 
> > Ah, looking at fcntl - do you want to set/get this via F_SETFD/F_GETFD,
> > or via F_SETFL/F_GETFL ? i.e. is this a file descritor flag, or a status
> > flag ? I'd guess a file descriptor flag but I'm not sure of the difference
> > here..
> 
> F_SETFD/F_GETFD operates on an entry in the fd table, e.g. separately
> if you have duped fds.  That's exactly what we do _not_ want here.

Ah, ok - I understand the difference now. So it's a status flag
get/set with F_SETFL/F_GETFL.

Ok, that works for me. Get me a patch in the kernel and I'll code
up the Samba changes very shortly (actually should be quite
easy :-).

Jeremy.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-01  6:54             ` Dave Chinner
@ 2011-09-02 13:08               ` Ted Ts'o
  2011-09-02 13:10                 ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Ted Ts'o @ 2011-09-02 13:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Christoph Hellwig, Daniel Ehrenberg, linux-kernel

On Thu, Sep 01, 2011 at 04:54:15PM +1000, Dave Chinner wrote:
> Apples to oranges - there's orders of magnitude of difference in the
> number of operations that the different stacks do. Allocation in XFS
> when it does not block can still take milliseconds of CPU time; in
> comparison, the networking stack is expected to process thousands of
> packets in that same time frame.  IOWs, the scale of processing per
> item of work is -vastly- different - that's why working in process
> context matters a great deal to the networking stack but not to
> allocation in XFS.

That may be true for hard drives, but PCIe attached flash can support
millions of IOP's per second --- i.e., at least hundreds of IOP's in
milliseconds.  Yes, these devices are expensive, but so are the
thousand-disk RAID arrays that some people attach via XFS.  :-)

There are people in the ext4 development community interested in
looking at such devices.  We've made some improvements, we (and by
this I mean the whole kernel) are a long, long way from supporting
such beasts properly....

							- Ted

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Approaches to making io_submit not block
  2011-09-02 13:08               ` Ted Ts'o
@ 2011-09-02 13:10                 ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2011-09-02 13:10 UTC (permalink / raw)
  To: Ted Ts'o, Dave Chinner, Andi Kleen, Christoph Hellwig,
	Daniel Ehrenberg, linux-kernel

On Fri, Sep 02, 2011 at 09:08:35AM -0400, Ted Ts'o wrote:
> That may be true for hard drives, but PCIe attached flash can support
> millions of IOP's per second --- i.e., at least hundreds of IOP's in
> milliseconds.  Yes, these devices are expensive, but so are the
> thousand-disk RAID arrays that some people attach via XFS.  :-)

reading / writing to the devices is different from allocator calls.

You reall do not want to do an allocator call for each of these device.


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2011-09-02 13:11 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-29 17:33 Approaches to making io_submit not block Daniel Ehrenberg
2011-08-30  5:32 ` Christoph Hellwig
2011-08-30 21:51   ` Daniel Ehrenberg
2011-08-31  5:26     ` Christoph Hellwig
2011-08-31 17:08       ` Andi Kleen
2011-08-31 21:00         ` Daniel Ehrenberg
2011-08-31 21:15           ` Andi Kleen
2011-09-01  4:18         ` Dave Chinner
2011-09-01  4:39           ` Andi Kleen
2011-09-01  6:54             ` Dave Chinner
2011-09-02 13:08               ` Ted Ts'o
2011-09-02 13:10                 ` Christoph Hellwig
2011-09-01  3:39       ` Dave Chinner
2011-09-01  4:20         ` Christoph Hellwig
2011-08-30  7:02 ` Andi Kleen
     [not found] ` <CAAK6Zt0Sh1GdEOb-tNf2FGXJs=e1Jbcqew13R_GdTqrv6vW97w@mail.gmail.com>
     [not found]   ` <x49k49uk2ox.fsf@segfault.boston.devel.redhat.com>
     [not found]     ` <4E5D5817.6040704@kernel.dk>
2011-08-30 22:19       ` Daniel Ehrenberg
2011-08-30 22:32         ` Jens Axboe
2011-08-30 22:41           ` Andrew Morton
2011-08-30 22:45             ` Daniel Ehrenberg
2011-08-30 22:54               ` Andrew Morton
2011-08-30 23:03                 ` Jeremy Allison
2011-08-30 23:11                   ` Andrew Morton
2011-08-31 11:04                     ` Ulrich Drepper
2011-08-31 16:59                       ` Jeremy Allison
2011-09-01 11:14                         ` Ulrich Drepper
2011-09-01 15:58                           ` Jeremy Allison
2011-09-01 16:04                             ` Christoph Hellwig
2011-09-01 16:15                               ` Jeremy Allison
2011-09-01 16:23                                 ` Christoph Hellwig
2011-09-01 16:31                                   ` Jeremy Allison
2011-09-01 16:34                                     ` Christoph Hellwig
2011-09-01 16:34                                     ` Jeremy Allison
2011-09-01 16:45                                       ` Christoph Hellwig
2011-09-01 16:57                                         ` Jeremy Allison
2011-08-31  5:34                   ` Christoph Hellwig
2011-08-31  6:04                 ` guy keren
2011-08-31 23:16                   ` Daniel Ehrenberg
2011-08-31 23:48                     ` guy keren
2011-08-31 23:59                       ` Daniel Ehrenberg
2011-08-31 15:45                 ` Gleb Natapov
2011-08-31 16:02                   ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.