linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* New AIO API
@ 2013-04-12 22:28 Kent Overstreet
  2013-04-15 22:31 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Kent Overstreet @ 2013-04-12 22:28 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, lsf-pc
  Cc: akpm, Zach Brown, Felipe Balbi, Greg Kroah-Hartman, Mark Fasheh,
	Joel Becker, Rusty Russell, Jens Axboe, Asai Thambi S P,
	Selvan Mani, Sam Bradshaw, Al Viro, Benjamin LaHaise,
	Theodore Ts'o

So, awhile back I posted about an extensible AIO attributes mechanism
I'd been cooking up: http://article.gmane.org/gmane.linux.kernel/1367969

Since then, more uses for the thing have been popping up, but I ran into
a roadblock - with the existing AIO api, return values for the
attributes were going to be, at best, considerably uglier than I
anticipated.

Some background: some attributes we'd like to implement need to be able
to return values with the io_event at completion time. Many of the
examples I know of are more or less tracing - returning how long the IO
took, whether it was a cache hit or miss (bcache, perhaps page cache
when buffered AIO is supported), etc.

Additionally, you probably want to be able to return whether the
attribute was supported/handled at all (because of differing kernel
versions, or because it was driver specific) and we need attribute
returns to be able to sanely handle that.

So my opinion is that the only really sane way to implement attribute
return values is to pass them back to userspace via the ringbuffer,
along with the struct io_event.

(For those not intimately familiar with the AIO implementation, on
completion the generated io_event is copied into a ringbuffer which
happens to be mapped into userspace, even though normally userspace will
get the io_event with io_getevents(). This ringbuffer constrains the
design quite a bit, though).

Trouble is, we (probably, there is some debate) can't really just change
the existing ringbuffer format - there's a version field in the existing
ringbuffer, but userspace can't check that until after the ringbuffer is
setup and mapped into userspace. There's no existing mechanism for
userspace to specify flags or options or versioning when setting up the
io context.

So, to do this requires new syscalls, and more or less forking most of
the existing AIO implementation. Also, returning variable length entries
via the ringbuffer turns out to require redesigning a substantial
fraction of the existing AIO implementation - so we might as well fix
everything else that needs fixing at the same time.

Where I'm at now - I've got a new syscall interface that changes enough
to support extensible AIO attributes prototyped; it looks almost
complete but I haven't started testing yet. Enough is there to see how
it all fits together, though - IMO the important bits are how we deal
with different types of kioctxs (I think it works out fairly nicely).

Code is available at http://evilpiepirate.org/git/linux-bcache.git/ aio-new-abi
(Definitely broken, don't even think about trying to run it yet).

We plan on rolling this out at Google in the near term with the minimal
set of changes (because we've got stuff blocked on this), but there's
more changes I'd like to make before this (hopefully) goes upstream.

So, what changes?

 * Currently, we strictly limit outstanding kiocbs so as to avoid
   overflowing the ringbuffer; this means that the size of the
   ringubffer we allocate is determined by the nr_events userspace
   passes to io_setup().

   This approach doesn't work when ringbuffer entries are variable
   length - we can still use a ringbuffer (and I think we want to), but
   we need to have an overflow mechanism for when it fills up.

   This is actually one of the backwards compatibility issues;
   currently, it is possible for userspace to reap io_events without
   ever calling into the kernel. But if we've got an overflow mechanism,
   that's no longer possible - userspace has to call io_getevents() when
   the ringbuffer's empty, or it'll never see events that might've been
   on the overflow list - that or we need to put a flag in the
   ringbuffer header.

   Adding the overflow mechanism is an overall reduction in complexity
   though, we can toss out a bunch of code elsewhere and ringbuffer size
   isn't so important anymore.

 * With the way the head/tail pointers are defined in the current
   ringbuffer implentation, we can't do lockless reaping without being
   subject to ABA. I've fixed this in my prototype - the head/tail
   values use the full range of 32 bit integers, we only mod them by the
   ringbuffer size when calculating the current position.

 * The head/tail pointers, and also io_submit()/io_getevents() all work
   in units of struct iocb/struct io_event. With attributes those
   structs are now variable length, so it makes more sense to switch
   all the units to bytes.

   With these changes, the ringbuffer implementation is looking less and
   less AIO specific. I've been wondering a bit whether it could be made
   generic and merged with other ringbuffers (I'm not sure what else
   there is offhand, besides tracing - tracing has substantially
   different needs, but I'd be surprised if there aren't other similar
   ringbuffers somewhere).

 * The eventfd field should've never been added to struct iocb, imo -
   it should've been added to the kioctx (You don't want to know when a
   specific iocb is done, there isn't any way to check for that directly
   - you want to know when there's events to reap). I'm fixing that.

 * Adding a version parameter to io_setup2()

Those are the main changes (besides adding attributes, of course) that
I've made so far. 

 * Get rid of the parallel syscall interface 

   AIO really shouldn't be implementing its own slightly different
   syscalls; it should be a mechanism for doing syscalls asynchronously.

   If we don't have asynchronous implementations of most of our syscalls
   right now, so what? Tying the interface to the implementation is
   still stupid. And if we're lucky, someday we'll have a generic thread
   pool implementation for all the syscalls that aren't worth special
   casing (perhaps building off the work Ben LaHaise has been doing to
   implement buffered AIO).

   This is particularly important now with attributes - almost none of
   the attributes we want to implement are actually AIO specific; we'd
   like to be able to use them with arbitrary syscalls.

   Well, if we turn AIO into a mechanism for doing arbitrary syscalls
   asynchronously - it'll be really easy to add one syscall to issue an
   iocb synchronously; at that point it'll just be an "issue this
   syscall with attributes" syscall.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: New AIO API
  2013-04-12 22:28 New AIO API Kent Overstreet
@ 2013-04-15 22:31 ` Andrew Morton
  2013-04-16  1:18   ` Rusty Russell
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2013-04-15 22:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-aio, linux-fsdevel, linux-kernel, lsf-pc, Zach Brown,
	Felipe Balbi, Greg Kroah-Hartman, Mark Fasheh, Joel Becker,
	Rusty Russell, Jens Axboe, Asai Thambi S P, Selvan Mani,
	Sam Bradshaw, Al Viro, Benjamin LaHaise, Theodore Ts'o

On Fri, 12 Apr 2013 15:28:56 -0700 Kent Overstreet <koverstreet@google.com> wrote:

> So, awhile back I posted about an extensible AIO attributes mechanism
> I'd been cooking up: http://article.gmane.org/gmane.linux.kernel/1367969
> 
> Since then, more uses for the thing have been popping up, but I ran into
> a roadblock - with the existing AIO api, return values for the
> attributes were going to be, at best, considerably uglier than I
> anticipated.
> 
> Some background: some attributes we'd like to implement need to be able
> to return values with the io_event at completion time. Many of the
> examples I know of are more or less tracing - returning how long the IO
> took, whether it was a cache hit or miss (bcache, perhaps page cache
> when buffered AIO is supported), etc.
> 
> Additionally, you probably want to be able to return whether the
> attribute was supported/handled at all (because of differing kernel
> versions, or because it was driver specific) and we need attribute
> returns to be able to sanely handle that.
> 
> So my opinion is that the only really sane way to implement attribute
> return values is to pass them back to userspace via the ringbuffer,
> along with the struct io_event.
> 
> (For those not intimately familiar with the AIO implementation, on
> completion the generated io_event is copied into a ringbuffer which
> happens to be mapped into userspace, even though normally userspace will
> get the io_event with io_getevents(). This ringbuffer constrains the
> design quite a bit, though).
> 
> Trouble is, we (probably, there is some debate) can't really just change
> the existing ringbuffer format - there's a version field in the existing
> ringbuffer, but userspace can't check that until after the ringbuffer is
> setup and mapped into userspace. There's no existing mechanism for
> userspace to specify flags or options or versioning when setting up the
> io context.
> 
> So, to do this requires new syscalls, and more or less forking most of
> the existing AIO implementation. Also, returning variable length entries
> via the ringbuffer turns out to require redesigning a substantial
> fraction of the existing AIO implementation - so we might as well fix
> everything else that needs fixing at the same time.

This all sounds like a lot of work, risk, disruption, bloat, etc, etc. 
That's not a problem per-se, but it should only be undertaken if the
payback makes it worthwhile.

Unfortunately your email contains only a terse description of this most
important factor: if we add all this stuff to Linux, what do we get in
return?  "More or less tracing".  Is that useful enough to justify the
changes?  Please let's pay a lot more attention to this question before
getting further into implementation stuff!  Sell it to us.

> Those are the main changes (besides adding attributes, of course) that
> I've made so far. 
> 
>  * Get rid of the parallel syscall interface 
> 
>    AIO really shouldn't be implementing its own slightly different
>    syscalls; it should be a mechanism for doing syscalls asynchronously.

Yes.  We got about a twelfth of the way there many years ago
(google("syslets")) but it died.  A shame.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: New AIO API
  2013-04-15 22:31 ` Andrew Morton
@ 2013-04-16  1:18   ` Rusty Russell
  2013-04-16 17:48     ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Rusty Russell @ 2013-04-16  1:18 UTC (permalink / raw)
  To: Andrew Morton, Kent Overstreet
  Cc: linux-aio, linux-fsdevel, linux-kernel, lsf-pc, Zach Brown,
	Felipe Balbi, Greg Kroah-Hartman, Mark Fasheh, Joel Becker,
	Jens Axboe, Asai Thambi S P, Selvan Mani, Sam Bradshaw, Al Viro,
	Benjamin LaHaise, Theodore Ts'o

Andrew Morton <akpm@linux-foundation.org> writes:
> On Fri, 12 Apr 2013 15:28:56 -0700 Kent Overstreet <koverstreet@google.com> wrote:
>> Those are the main changes (besides adding attributes, of course) that
>> I've made so far. 
>> 
>>  * Get rid of the parallel syscall interface 
>> 
>>    AIO really shouldn't be implementing its own slightly different
>>    syscalls; it should be a mechanism for doing syscalls asynchronously.
>
> Yes.  We got about a twelfth of the way there many years ago
> (google("syslets")) but it died.  A shame.

Yeah, letting the current process keep waiting and creating a new one
which returns is a fascinating idea, but you really need to swizzle the
PIDs so that the "new" one is identical to the old.  Otherwise the API
is unbearable...

Good luck!
Rusty.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: New AIO API
  2013-04-16  1:18   ` Rusty Russell
@ 2013-04-16 17:48     ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2013-04-16 17:48 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, Kent Overstreet, linux-aio, linux-fsdevel,
	linux-kernel, lsf-pc, Zach Brown, Felipe Balbi,
	Greg Kroah-Hartman, Mark Fasheh, Joel Becker, Jens Axboe,
	Asai Thambi S P, Selvan Mani, Sam Bradshaw, Al Viro,
	Benjamin LaHaise, Theodore Ts'o

On Tue 16-04-13 10:48:35, Rusty Russell wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> > On Fri, 12 Apr 2013 15:28:56 -0700 Kent Overstreet <koverstreet@google.com> wrote:
> >> Those are the main changes (besides adding attributes, of course) that
> >> I've made so far. 
> >> 
> >>  * Get rid of the parallel syscall interface 
> >> 
> >>    AIO really shouldn't be implementing its own slightly different
> >>    syscalls; it should be a mechanism for doing syscalls asynchronously.
> >
> > Yes.  We got about a twelfth of the way there many years ago
> > (google("syslets")) but it died.  A shame.
> 
> Yeah, letting the current process keep waiting and creating a new one
> which returns is a fascinating idea, but you really need to swizzle the
> PIDs so that the "new" one is identical to the old.  Otherwise the API
> is unbearable...
  But when we do crazy stuff like pid namespaces these days somehow
switching pids shouldn't be *that* hard... Should it? Just a crazy idea
that occured to me now :)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-04-16 17:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-12 22:28 New AIO API Kent Overstreet
2013-04-15 22:31 ` Andrew Morton
2013-04-16  1:18   ` Rusty Russell
2013-04-16 17:48     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).