All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] fs: aio fix rcu lookup
@ 2011-01-14  1:35 Nick Piggin
  2011-01-14 14:52 ` Jeff Moyer
  0 siblings, 1 reply; 40+ messages in thread
From: Nick Piggin @ 2011-01-14  1:35 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 81 bytes --]

Hi,

While hunting down a bug in NFS's AIO, I believe I found this
buggy code...

[-- Attachment #2: fs-aio-race-fix.patch --]
[-- Type: application/octet-stream, Size: 1796 bytes --]

fs: aio fix rcu ioctx lookup

aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we
might take a refcount on a zero count ioctx.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/aio.c
===================================================================
--- linux-2.6.orig/fs/aio.c	2011-01-14 00:29:00.000000000 +1100
+++ linux-2.6/fs/aio.c	2011-01-14 11:31:47.000000000 +1100
@@ -239,15 +239,23 @@ static void __put_ioctx(struct kioctx *c
 	call_rcu(&ctx->rcu_head, ctx_rcu_free);
 }
 
-#define get_ioctx(kioctx) do {						\
-	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
-	atomic_inc(&(kioctx)->users);					\
-} while (0)
-#define put_ioctx(kioctx) do {						\
-	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
-	if (unlikely(atomic_dec_and_test(&(kioctx)->users))) 		\
-		__put_ioctx(kioctx);					\
-} while (0)
+static inline void get_ioctx(struct kioctx *kioctx)
+{
+	BUG_ON(atomic_read(&kioctx->users) <= 0);
+	atomic_inc(&kioctx->users);
+}
+
+static inline int try_get_ioctx(struct kioctx *kioctx)
+{
+	return atomic_inc_not_zero(&kioctx->users);
+}
+
+static inline void put_ioctx(struct kioctx *kioctx)
+{
+	BUG_ON(atomic_read(&kioctx->users) <= 0);
+	if (unlikely(atomic_dec_and_test(&kioctx->users)))
+		__put_ioctx(kioctx);
+}
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
@@ -601,8 +609,7 @@ static struct kioctx *lookup_ioctx(unsig
 	rcu_read_lock();
 
 	hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
-		if (ctx->user_id == ctx_id && !ctx->dead) {
-			get_ioctx(ctx);
+		if (ctx->user_id == ctx_id && !ctx->dead && try_get_ioctx(ctx)){
 			ret = ctx;
 			break;
 		}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-14  1:35 [patch] fs: aio fix rcu lookup Nick Piggin
@ 2011-01-14 14:52 ` Jeff Moyer
  2011-01-14 15:00     ` Nick Piggin
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2011-01-14 14:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

Nick Piggin <npiggin@gmail.com> writes:

> Hi,
>
> While hunting down a bug in NFS's AIO, I believe I found this
> buggy code...
>
> fs: aio fix rcu ioctx lookup
>
> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>
> lookup_ioctx doesn't implement the rcu lookup pattern properly.
> rcu_read_lock does not prevent refcount going to zero, so we
> might take a refcount on a zero count ioctx.

So, does this patch fix the problem?  You didn't actually say....

> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
>
> Index: linux-2.6/fs/aio.c
> ===================================================================
> --- linux-2.6.orig/fs/aio.c	2011-01-14 00:29:00.000000000 +1100
> +++ linux-2.6/fs/aio.c	2011-01-14 11:31:47.000000000 +1100
> @@ -239,15 +239,23 @@ static void __put_ioctx(struct kioctx *c
>  	call_rcu(&ctx->rcu_head, ctx_rcu_free);
>  }
>  
> -#define get_ioctx(kioctx) do {						\
> -	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
> -	atomic_inc(&(kioctx)->users);					\
> -} while (0)
> -#define put_ioctx(kioctx) do {						\
> -	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
> -	if (unlikely(atomic_dec_and_test(&(kioctx)->users))) 		\
> -		__put_ioctx(kioctx);					\
> -} while (0)
> +static inline void get_ioctx(struct kioctx *kioctx)
> +{
> +	BUG_ON(atomic_read(&kioctx->users) <= 0);
> +	atomic_inc(&kioctx->users);
> +}
> +
> +static inline int try_get_ioctx(struct kioctx *kioctx)
> +{
> +	return atomic_inc_not_zero(&kioctx->users);
> +}
> +
> +static inline void put_ioctx(struct kioctx *kioctx)
> +{
> +	BUG_ON(atomic_read(&kioctx->users) <= 0);
> +	if (unlikely(atomic_dec_and_test(&kioctx->users)))
> +		__put_ioctx(kioctx);
> +}

Why did you switch from macros?  Personal preference?  Can you at least
mention it in the changelog?

>  
>  /* ioctx_alloc
>   *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
> @@ -601,8 +609,7 @@ static struct kioctx *lookup_ioctx(unsig
>  	rcu_read_lock();
>  
>  	hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
> -		if (ctx->user_id == ctx_id && !ctx->dead) {
> -			get_ioctx(ctx);
> +		if (ctx->user_id == ctx_id && !ctx->dead && try_get_ioctx(ctx)){
>  			ret = ctx;
>  			break;
>  		}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-14 14:52 ` Jeff Moyer
@ 2011-01-14 15:00     ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-14 15:00 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Sat, Jan 15, 2011 at 1:52 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> Hi,
>>
>> While hunting down a bug in NFS's AIO, I believe I found this
>> buggy code...
>>
>> fs: aio fix rcu ioctx lookup
>>
>> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>>
>> lookup_ioctx doesn't implement the rcu lookup pattern properly.
>> rcu_read_lock does not prevent refcount going to zero, so we
>> might take a refcount on a zero count ioctx.
>
> So, does this patch fix the problem?  You didn't actually say....

No, it seemd to be an NFS AIO problem, although it was a
slightly older kernel so I'll re test after -rc1 if I haven't heard
back about it.

Do you agree with the theoretical problem? I didn't try to
write a racer to break it yet. Inserting a delay before the
get_ioctx might do the trick.


>> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
>>
>> Index: linux-2.6/fs/aio.c
>> ===================================================================
>> --- linux-2.6.orig/fs/aio.c   2011-01-14 00:29:00.000000000 +1100
>> +++ linux-2.6/fs/aio.c        2011-01-14 11:31:47.000000000 +1100
>> @@ -239,15 +239,23 @@ static void __put_ioctx(struct kioctx *c
>>       call_rcu(&ctx->rcu_head, ctx_rcu_free);
>>  }
>>
>> -#define get_ioctx(kioctx) do {                                               \
>> -     BUG_ON(atomic_read(&(kioctx)->users) <= 0);                     \
>> -     atomic_inc(&(kioctx)->users);                                   \
>> -} while (0)
>> -#define put_ioctx(kioctx) do {                                               \
>> -     BUG_ON(atomic_read(&(kioctx)->users) <= 0);                     \
>> -     if (unlikely(atomic_dec_and_test(&(kioctx)->users)))            \
>> -             __put_ioctx(kioctx);                                    \
>> -} while (0)
>> +static inline void get_ioctx(struct kioctx *kioctx)
>> +{
>> +     BUG_ON(atomic_read(&kioctx->users) <= 0);
>> +     atomic_inc(&kioctx->users);
>> +}
>> +
>> +static inline int try_get_ioctx(struct kioctx *kioctx)
>> +{
>> +     return atomic_inc_not_zero(&kioctx->users);
>> +}
>> +
>> +static inline void put_ioctx(struct kioctx *kioctx)
>> +{
>> +     BUG_ON(atomic_read(&kioctx->users) <= 0);
>> +     if (unlikely(atomic_dec_and_test(&kioctx->users)))
>> +             __put_ioctx(kioctx);
>> +}
>
> Why did you switch from macros?  Personal preference?  Can you at least
> mention it in the changelog?

Yeah, I couldn't bring myself to add another macro :) I can mention
it, sure.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-14 15:00     ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-14 15:00 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Sat, Jan 15, 2011 at 1:52 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> Hi,
>>
>> While hunting down a bug in NFS's AIO, I believe I found this
>> buggy code...
>>
>> fs: aio fix rcu ioctx lookup
>>
>> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>>
>> lookup_ioctx doesn't implement the rcu lookup pattern properly.
>> rcu_read_lock does not prevent refcount going to zero, so we
>> might take a refcount on a zero count ioctx.
>
> So, does this patch fix the problem?  You didn't actually say....

No, it seemd to be an NFS AIO problem, although it was a
slightly older kernel so I'll re test after -rc1 if I haven't heard
back about it.

Do you agree with the theoretical problem? I didn't try to
write a racer to break it yet. Inserting a delay before the
get_ioctx might do the trick.


>> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
>>
>> Index: linux-2.6/fs/aio.c
>> ===================================================================
>> --- linux-2.6.orig/fs/aio.c   2011-01-14 00:29:00.000000000 +1100
>> +++ linux-2.6/fs/aio.c        2011-01-14 11:31:47.000000000 +1100
>> @@ -239,15 +239,23 @@ static void __put_ioctx(struct kioctx *c
>>       call_rcu(&ctx->rcu_head, ctx_rcu_free);
>>  }
>>
>> -#define get_ioctx(kioctx) do {                                               \
>> -     BUG_ON(atomic_read(&(kioctx)->users) <= 0);                     \
>> -     atomic_inc(&(kioctx)->users);                                   \
>> -} while (0)
>> -#define put_ioctx(kioctx) do {                                               \
>> -     BUG_ON(atomic_read(&(kioctx)->users) <= 0);                     \
>> -     if (unlikely(atomic_dec_and_test(&(kioctx)->users)))            \
>> -             __put_ioctx(kioctx);                                    \
>> -} while (0)
>> +static inline void get_ioctx(struct kioctx *kioctx)
>> +{
>> +     BUG_ON(atomic_read(&kioctx->users) <= 0);
>> +     atomic_inc(&kioctx->users);
>> +}
>> +
>> +static inline int try_get_ioctx(struct kioctx *kioctx)
>> +{
>> +     return atomic_inc_not_zero(&kioctx->users);
>> +}
>> +
>> +static inline void put_ioctx(struct kioctx *kioctx)
>> +{
>> +     BUG_ON(atomic_read(&kioctx->users) <= 0);
>> +     if (unlikely(atomic_dec_and_test(&kioctx->users)))
>> +             __put_ioctx(kioctx);
>> +}
>
> Why did you switch from macros?  Personal preference?  Can you at least
> mention it in the changelog?

Yeah, I couldn't bring myself to add another macro :) I can mention
it, sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-14 15:00     ` Nick Piggin
  (?)
@ 2011-01-17 19:07     ` Jeff Moyer
  2011-01-17 23:24       ` Nick Piggin
  -1 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2011-01-17 19:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

Nick Piggin <npiggin@gmail.com> writes:

> On Sat, Jan 15, 2011 at 1:52 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> Hi,
>>>
>>> While hunting down a bug in NFS's AIO, I believe I found this
>>> buggy code...
>>>
>>> fs: aio fix rcu ioctx lookup
>>>
>>> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>>>
>>> lookup_ioctx doesn't implement the rcu lookup pattern properly.
>>> rcu_read_lock does not prevent refcount going to zero, so we
>>> might take a refcount on a zero count ioctx.
>>
>> So, does this patch fix the problem?  You didn't actually say....
>
> No, it seemd to be an NFS AIO problem, although it was a
> slightly older kernel so I'll re test after -rc1 if I haven't heard
> back about it.

OK.

> Do you agree with the theoretical problem? I didn't try to
> write a racer to break it yet. Inserting a delay before the
> get_ioctx might do the trick.

I'm not convinced, no.  The last reference to the kioctx is always the
process, released in the exit_aio path, or via sys_io_destroy.  In both
cases, we cancel all aios, then wait for them all to complete before
dropping the final reference to the context.

So, while I agree that what you wrote is better, I remain unconvinced of
it solving a real-world problem.  Feel free to push it in as a cleanup,
though.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-17 19:07     ` Jeff Moyer
@ 2011-01-17 23:24       ` Nick Piggin
  2011-01-18 17:21           ` Jeff Moyer
  2011-01-18 19:01         ` Jan Kara
  0 siblings, 2 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-17 23:24 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On Sat, Jan 15, 2011 at 1:52 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Nick Piggin <npiggin@gmail.com> writes:
>>>
>>>> Hi,
>>>>
>>>> While hunting down a bug in NFS's AIO, I believe I found this
>>>> buggy code...
>>>>
>>>> fs: aio fix rcu ioctx lookup
>>>>
>>>> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>>>>
>>>> lookup_ioctx doesn't implement the rcu lookup pattern properly.
>>>> rcu_read_lock does not prevent refcount going to zero, so we
>>>> might take a refcount on a zero count ioctx.
>>>
>>> So, does this patch fix the problem?  You didn't actually say....
>>
>> No, it seemd to be an NFS AIO problem, although it was a
>> slightly older kernel so I'll re test after -rc1 if I haven't heard
>> back about it.
>
> OK.
>
>> Do you agree with the theoretical problem? I didn't try to
>> write a racer to break it yet. Inserting a delay before the
>> get_ioctx might do the trick.
>
> I'm not convinced, no.  The last reference to the kioctx is always the
> process, released in the exit_aio path, or via sys_io_destroy.  In both
> cases, we cancel all aios, then wait for them all to complete before
> dropping the final reference to the context.

That wouldn't appear to prevent a concurrent thread from doing an
io operation that requires ioctx lookup, and taking the last reference
after the io_cancel thread drops the ref.


> So, while I agree that what you wrote is better, I remain unconvinced of
> it solving a real-world problem.  Feel free to push it in as a cleanup,
> though.

Well I think it has to be technically correct first. If there is indeed a
guaranteed ref somehow, it just needs a comment.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-17 23:24       ` Nick Piggin
@ 2011-01-18 17:21           ` Jeff Moyer
  2011-01-18 19:01         ` Jan Kara
  1 sibling, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2011-01-18 17:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

Nick Piggin <npiggin@gmail.com> writes:

> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On Sat, Jan 15, 2011 at 1:52 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Nick Piggin <npiggin@gmail.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> While hunting down a bug in NFS's AIO, I believe I found this
>>>>> buggy code...
>>>>>
>>>>> fs: aio fix rcu ioctx lookup
>>>>>
>>>>> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>>>>>
>>>>> lookup_ioctx doesn't implement the rcu lookup pattern properly.
>>>>> rcu_read_lock does not prevent refcount going to zero, so we
>>>>> might take a refcount on a zero count ioctx.
>>>>
>>>> So, does this patch fix the problem?  You didn't actually say....
>>>
>>> No, it seemd to be an NFS AIO problem, although it was a
>>> slightly older kernel so I'll re test after -rc1 if I haven't heard
>>> back about it.
>>
>> OK.
>>
>>> Do you agree with the theoretical problem? I didn't try to
>>> write a racer to break it yet. Inserting a delay before the
>>> get_ioctx might do the trick.
>>
>> I'm not convinced, no.  The last reference to the kioctx is always the
>> process, released in the exit_aio path, or via sys_io_destroy.  In both
>> cases, we cancel all aios, then wait for them all to complete before
>> dropping the final reference to the context.
>
> That wouldn't appear to prevent a concurrent thread from doing an
> io operation that requires ioctx lookup, and taking the last reference
> after the io_cancel thread drops the ref.

io_cancel isn't of any concern here.  When io_setup is called, it
creates the ioctx and takes 2 references to it.  There are two paths to
destroying the ioctx: one is through process exit, the other is through
a call to sys_io_destroy.  The former means that you can't submit more
I/O anyway (which in turn means that there won't be any more lookups on
the ioctx), so I'll focus on the latter.

What you're asking about, then, is a race between lookup_ioctx and
io_destroy.  The first thing io_destroy does is to set ctx->dead to 1
and remove the ioctx from the list:

   spin_lock(&mm->ioctx_lock);
   was_dead = ioctx->dead;
   ioctx->dead = 1;
   hlist_del_rcu(&ioctx->list);
   spin_unlock(&mm->ioctx_lock);

   if (likely(!was_dead))
           put_ioctx(ioctx); /* twice for the list */

   aio_cancel_all(ioctx);
   wait_for_all_aios(ioctx);

   wake_up(&ioctx->wait);
   put_ioctx(ioctx);   /* once for the lookup */

The lookup code is this:

   rcu_read_lock();

   hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
   if (ctx->user_id == ctx_id && !ctx->dead) {
           get_ioctx(ctx);
           ret = ctx;
           break;
...

In order for the race to occur, the lookup code would have to find the
ioctx on the list without the dead mark set.  Then, the io_destroy code
would have to do all of its work, including its two put_ioctx calls, and
finally the get_ioctx from the lookup would have to happen.

Possible?  Maybe.  It certainly isn't explicitly protected against.  Go
ahead and re-post the patch.  I agree that it's a theoretical race.  =)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-18 17:21           ` Jeff Moyer
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2011-01-18 17:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

Nick Piggin <npiggin@gmail.com> writes:

> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On Sat, Jan 15, 2011 at 1:52 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Nick Piggin <npiggin@gmail.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> While hunting down a bug in NFS's AIO, I believe I found this
>>>>> buggy code...
>>>>>
>>>>> fs: aio fix rcu ioctx lookup
>>>>>
>>>>> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
>>>>>
>>>>> lookup_ioctx doesn't implement the rcu lookup pattern properly.
>>>>> rcu_read_lock does not prevent refcount going to zero, so we
>>>>> might take a refcount on a zero count ioctx.
>>>>
>>>> So, does this patch fix the problem?  You didn't actually say....
>>>
>>> No, it seemd to be an NFS AIO problem, although it was a
>>> slightly older kernel so I'll re test after -rc1 if I haven't heard
>>> back about it.
>>
>> OK.
>>
>>> Do you agree with the theoretical problem? I didn't try to
>>> write a racer to break it yet. Inserting a delay before the
>>> get_ioctx might do the trick.
>>
>> I'm not convinced, no.  The last reference to the kioctx is always the
>> process, released in the exit_aio path, or via sys_io_destroy.  In both
>> cases, we cancel all aios, then wait for them all to complete before
>> dropping the final reference to the context.
>
> That wouldn't appear to prevent a concurrent thread from doing an
> io operation that requires ioctx lookup, and taking the last reference
> after the io_cancel thread drops the ref.

io_cancel isn't of any concern here.  When io_setup is called, it
creates the ioctx and takes 2 references to it.  There are two paths to
destroying the ioctx: one is through process exit, the other is through
a call to sys_io_destroy.  The former means that you can't submit more
I/O anyway (which in turn means that there won't be any more lookups on
the ioctx), so I'll focus on the latter.

What you're asking about, then, is a race between lookup_ioctx and
io_destroy.  The first thing io_destroy does is to set ctx->dead to 1
and remove the ioctx from the list:

   spin_lock(&mm->ioctx_lock);
   was_dead = ioctx->dead;
   ioctx->dead = 1;
   hlist_del_rcu(&ioctx->list);
   spin_unlock(&mm->ioctx_lock);

   if (likely(!was_dead))
           put_ioctx(ioctx); /* twice for the list */

   aio_cancel_all(ioctx);
   wait_for_all_aios(ioctx);

   wake_up(&ioctx->wait);
   put_ioctx(ioctx);   /* once for the lookup */

The lookup code is this:

   rcu_read_lock();

   hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
   if (ctx->user_id == ctx_id && !ctx->dead) {
           get_ioctx(ctx);
           ret = ctx;
           break;
...

In order for the race to occur, the lookup code would have to find the
ioctx on the list without the dead mark set.  Then, the io_destroy code
would have to do all of its work, including its two put_ioctx calls, and
finally the get_ioctx from the lookup would have to happen.

Possible?  Maybe.  It certainly isn't explicitly protected against.  Go
ahead and re-post the patch.  I agree that it's a theoretical race.  =)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-17 23:24       ` Nick Piggin
  2011-01-18 17:21           ` Jeff Moyer
@ 2011-01-18 19:01         ` Jan Kara
  2011-01-18 22:17           ` Nick Piggin
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kara @ 2011-01-18 19:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

  Hi,

On Tue 18-01-11 10:24:24, Nick Piggin wrote:
> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Nick Piggin <npiggin@gmail.com> writes:
> >> Do you agree with the theoretical problem? I didn't try to
> >> write a racer to break it yet. Inserting a delay before the
> >> get_ioctx might do the trick.
> >
> > I'm not convinced, no.  The last reference to the kioctx is always the
> > process, released in the exit_aio path, or via sys_io_destroy.  In both
> > cases, we cancel all aios, then wait for them all to complete before
> > dropping the final reference to the context.
> 
> That wouldn't appear to prevent a concurrent thread from doing an
> io operation that requires ioctx lookup, and taking the last reference
> after the io_cancel thread drops the ref.
> 
> > So, while I agree that what you wrote is better, I remain unconvinced of
> > it solving a real-world problem.  Feel free to push it in as a cleanup,
> > though.
> 
> Well I think it has to be technically correct first. If there is indeed a
> guaranteed ref somehow, it just needs a comment.
  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
from the hash table and set ioctx->dead which is supposed to stop
lookup_ioctx() from finding it (see the !ctx->dead check in
lookup_ioctx()). There's even a comment in io_destroy() saying:
        /*
         * Wake up any waiters.  The setting of ctx->dead must be seen
         * by other CPUs at this point.  Right now, we rely on the
         * locking done by the above calls to ensure this consistency.
         */
But since lookup_ioctx() is called without any lock or barrier nothing
really seems to prevent the list traversal and ioctx->dead test to happen
before io_destroy() and get_ioctx() after io_destroy().

But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
Because with your fix we could still return 'dead' ioctx and I don't think
we are supposed to do that...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-18 19:01         ` Jan Kara
@ 2011-01-18 22:17           ` Nick Piggin
  2011-01-18 23:00             ` Jeff Moyer
  2011-01-18 23:52               ` Jan Kara
  0 siblings, 2 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-18 22:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi,
>
> On Tue 18-01-11 10:24:24, Nick Piggin wrote:
>> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> > Nick Piggin <npiggin@gmail.com> writes:
>> >> Do you agree with the theoretical problem? I didn't try to
>> >> write a racer to break it yet. Inserting a delay before the
>> >> get_ioctx might do the trick.
>> >
>> > I'm not convinced, no.  The last reference to the kioctx is always the
>> > process, released in the exit_aio path, or via sys_io_destroy.  In both
>> > cases, we cancel all aios, then wait for them all to complete before
>> > dropping the final reference to the context.
>>
>> That wouldn't appear to prevent a concurrent thread from doing an
>> io operation that requires ioctx lookup, and taking the last reference
>> after the io_cancel thread drops the ref.
>>
>> > So, while I agree that what you wrote is better, I remain unconvinced of
>> > it solving a real-world problem.  Feel free to push it in as a cleanup,
>> > though.
>>
>> Well I think it has to be technically correct first. If there is indeed a
>> guaranteed ref somehow, it just needs a comment.
>  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
> from the hash table and set ioctx->dead which is supposed to stop
> lookup_ioctx() from finding it (see the !ctx->dead check in
> lookup_ioctx()). There's even a comment in io_destroy() saying:
>        /*
>         * Wake up any waiters.  The setting of ctx->dead must be seen
>         * by other CPUs at this point.  Right now, we rely on the
>         * locking done by the above calls to ensure this consistency.
>         */
> But since lookup_ioctx() is called without any lock or barrier nothing
> really seems to prevent the list traversal and ioctx->dead test to happen
> before io_destroy() and get_ioctx() after io_destroy().
>
> But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
> Because with your fix we could still return 'dead' ioctx and I don't think
> we are supposed to do that...

With my fix we won't oops, I was a bit concerned about ->dead,
yes but I don't know what semantics it is attempted to have there.

synchronize_rcu() in io_destroy()  does not prevent it from returning
as soon as lookup_ioctx drops the rcu_read_lock().

The dead=1 in io_destroy indeed doesn't guarantee a whole lot.
Anyone know?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-18 22:17           ` Nick Piggin
@ 2011-01-18 23:00             ` Jeff Moyer
  2011-01-18 23:05                 ` Nick Piggin
  2011-01-18 23:52               ` Jan Kara
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2011-01-18 23:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel

Nick Piggin <npiggin@gmail.com> writes:

> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
>>  Hi,
>>
>> On Tue 18-01-11 10:24:24, Nick Piggin wrote:
>>> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> > Nick Piggin <npiggin@gmail.com> writes:
>>> >> Do you agree with the theoretical problem? I didn't try to
>>> >> write a racer to break it yet. Inserting a delay before the
>>> >> get_ioctx might do the trick.
>>> >
>>> > I'm not convinced, no.  The last reference to the kioctx is always the
>>> > process, released in the exit_aio path, or via sys_io_destroy.  In both
>>> > cases, we cancel all aios, then wait for them all to complete before
>>> > dropping the final reference to the context.
>>>
>>> That wouldn't appear to prevent a concurrent thread from doing an
>>> io operation that requires ioctx lookup, and taking the last reference
>>> after the io_cancel thread drops the ref.
>>>
>>> > So, while I agree that what you wrote is better, I remain unconvinced of
>>> > it solving a real-world problem.  Feel free to push it in as a cleanup,
>>> > though.
>>>
>>> Well I think it has to be technically correct first. If there is indeed a
>>> guaranteed ref somehow, it just needs a comment.
>>  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
>> from the hash table and set ioctx->dead which is supposed to stop
>> lookup_ioctx() from finding it (see the !ctx->dead check in
>> lookup_ioctx()). There's even a comment in io_destroy() saying:
>>        /*
>>         * Wake up any waiters.  The setting of ctx->dead must be seen
>>         * by other CPUs at this point.  Right now, we rely on the
>>         * locking done by the above calls to ensure this consistency.
>>         */
>> But since lookup_ioctx() is called without any lock or barrier nothing
>> really seems to prevent the list traversal and ioctx->dead test to happen
>> before io_destroy() and get_ioctx() after io_destroy().
>>
>> But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
>> Because with your fix we could still return 'dead' ioctx and I don't think
>> we are supposed to do that...
>
> With my fix we won't oops, I was a bit concerned about ->dead,
> yes but I don't know what semantics it is attempted to have there.
>
> synchronize_rcu() in io_destroy()  does not prevent it from returning
> as soon as lookup_ioctx drops the rcu_read_lock().
>
> The dead=1 in io_destroy indeed doesn't guarantee a whole lot.
> Anyone know?

See the comment above io_destroy for starters.  Note that rcu was
bolted on later, and I believe that ->dead has nothing to do with the
rcu-ification.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-18 23:00             ` Jeff Moyer
@ 2011-01-18 23:05                 ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-18 23:05 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel

On Wed, Jan 19, 2011 at 10:00 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
>>>  Hi,
>>>
>>> On Tue 18-01-11 10:24:24, Nick Piggin wrote:
>>>> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> > Nick Piggin <npiggin@gmail.com> writes:
>>>> >> Do you agree with the theoretical problem? I didn't try to
>>>> >> write a racer to break it yet. Inserting a delay before the
>>>> >> get_ioctx might do the trick.
>>>> >
>>>> > I'm not convinced, no.  The last reference to the kioctx is always the
>>>> > process, released in the exit_aio path, or via sys_io_destroy.  In both
>>>> > cases, we cancel all aios, then wait for them all to complete before
>>>> > dropping the final reference to the context.
>>>>
>>>> That wouldn't appear to prevent a concurrent thread from doing an
>>>> io operation that requires ioctx lookup, and taking the last reference
>>>> after the io_cancel thread drops the ref.
>>>>
>>>> > So, while I agree that what you wrote is better, I remain unconvinced of
>>>> > it solving a real-world problem.  Feel free to push it in as a cleanup,
>>>> > though.
>>>>
>>>> Well I think it has to be technically correct first. If there is indeed a
>>>> guaranteed ref somehow, it just needs a comment.
>>>  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
>>> from the hash table and set ioctx->dead which is supposed to stop
>>> lookup_ioctx() from finding it (see the !ctx->dead check in
>>> lookup_ioctx()). There's even a comment in io_destroy() saying:
>>>        /*
>>>         * Wake up any waiters.  The setting of ctx->dead must be seen
>>>         * by other CPUs at this point.  Right now, we rely on the
>>>         * locking done by the above calls to ensure this consistency.
>>>         */
>>> But since lookup_ioctx() is called without any lock or barrier nothing
>>> really seems to prevent the list traversal and ioctx->dead test to happen
>>> before io_destroy() and get_ioctx() after io_destroy().
>>>
>>> But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
>>> Because with your fix we could still return 'dead' ioctx and I don't think
>>> we are supposed to do that...
>>
>> With my fix we won't oops, I was a bit concerned about ->dead,
>> yes but I don't know what semantics it is attempted to have there.
>>
>> synchronize_rcu() in io_destroy()  does not prevent it from returning
>> as soon as lookup_ioctx drops the rcu_read_lock().
>>
>> The dead=1 in io_destroy indeed doesn't guarantee a whole lot.
>> Anyone know?
>
> See the comment above io_destroy for starters.  Note that rcu was
> bolted on later, and I believe that ->dead has nothing to do with the
> rcu-ification.

Ah, yep, that makes sense indeed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-18 23:05                 ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-18 23:05 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel

On Wed, Jan 19, 2011 at 10:00 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
>>>  Hi,
>>>
>>> On Tue 18-01-11 10:24:24, Nick Piggin wrote:
>>>> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> > Nick Piggin <npiggin@gmail.com> writes:
>>>> >> Do you agree with the theoretical problem? I didn't try to
>>>> >> write a racer to break it yet. Inserting a delay before the
>>>> >> get_ioctx might do the trick.
>>>> >
>>>> > I'm not convinced, no.  The last reference to the kioctx is always the
>>>> > process, released in the exit_aio path, or via sys_io_destroy.  In both
>>>> > cases, we cancel all aios, then wait for them all to complete before
>>>> > dropping the final reference to the context.
>>>>
>>>> That wouldn't appear to prevent a concurrent thread from doing an
>>>> io operation that requires ioctx lookup, and taking the last reference
>>>> after the io_cancel thread drops the ref.
>>>>
>>>> > So, while I agree that what you wrote is better, I remain unconvinced of
>>>> > it solving a real-world problem.  Feel free to push it in as a cleanup,
>>>> > though.
>>>>
>>>> Well I think it has to be technically correct first. If there is indeed a
>>>> guaranteed ref somehow, it just needs a comment.
>>>  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
>>> from the hash table and set ioctx->dead which is supposed to stop
>>> lookup_ioctx() from finding it (see the !ctx->dead check in
>>> lookup_ioctx()). There's even a comment in io_destroy() saying:
>>>        /*
>>>         * Wake up any waiters.  The setting of ctx->dead must be seen
>>>         * by other CPUs at this point.  Right now, we rely on the
>>>         * locking done by the above calls to ensure this consistency.
>>>         */
>>> But since lookup_ioctx() is called without any lock or barrier nothing
>>> really seems to prevent the list traversal and ioctx->dead test to happen
>>> before io_destroy() and get_ioctx() after io_destroy().
>>>
>>> But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
>>> Because with your fix we could still return 'dead' ioctx and I don't think
>>> we are supposed to do that...
>>
>> With my fix we won't oops, I was a bit concerned about ->dead,
>> yes but I don't know what semantics it is attempted to have there.
>>
>> synchronize_rcu() in io_destroy()  does not prevent it from returning
>> as soon as lookup_ioctx drops the rcu_read_lock().
>>
>> The dead=1 in io_destroy indeed doesn't guarantee a whole lot.
>> Anyone know?
>
> See the comment above io_destroy for starters.  Note that rcu was
> bolted on later, and I believe that ->dead has nothing to do with the
> rcu-ification.

Ah, yep, that makes sense indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-18 22:17           ` Nick Piggin
@ 2011-01-18 23:52               ` Jan Kara
  2011-01-18 23:52               ` Jan Kara
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Kara @ 2011-01-18 23:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Wed 19-01-11 09:17:23, Nick Piggin wrote:
> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 18-01-11 10:24:24, Nick Piggin wrote:
> >> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> > Nick Piggin <npiggin@gmail.com> writes:
> >> >> Do you agree with the theoretical problem? I didn't try to
> >> >> write a racer to break it yet. Inserting a delay before the
> >> >> get_ioctx might do the trick.
> >> >
> >> > I'm not convinced, no.  The last reference to the kioctx is always the
> >> > process, released in the exit_aio path, or via sys_io_destroy.  In both
> >> > cases, we cancel all aios, then wait for them all to complete before
> >> > dropping the final reference to the context.
> >>
> >> That wouldn't appear to prevent a concurrent thread from doing an
> >> io operation that requires ioctx lookup, and taking the last reference
> >> after the io_cancel thread drops the ref.
> >>
> >> > So, while I agree that what you wrote is better, I remain unconvinced of
> >> > it solving a real-world problem.  Feel free to push it in as a cleanup,
> >> > though.
> >>
> >> Well I think it has to be technically correct first. If there is indeed a
> >> guaranteed ref somehow, it just needs a comment.
> >  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
> > from the hash table and set ioctx->dead which is supposed to stop
> > lookup_ioctx() from finding it (see the !ctx->dead check in
> > lookup_ioctx()). There's even a comment in io_destroy() saying:
> >        /*
> >         * Wake up any waiters.  The setting of ctx->dead must be seen
> >         * by other CPUs at this point.  Right now, we rely on the
> >         * locking done by the above calls to ensure this consistency.
> >         */
> > But since lookup_ioctx() is called without any lock or barrier nothing
> > really seems to prevent the list traversal and ioctx->dead test to happen
> > before io_destroy() and get_ioctx() after io_destroy().
> >
> > But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
> > Because with your fix we could still return 'dead' ioctx and I don't think
> > we are supposed to do that...
> 
> With my fix we won't oops, I was a bit concerned about ->dead,
> yes but I don't know what semantics it is attempted to have there.
  But wouldn't it do something bad if the memory gets reallocated for
something else and set to non-zero? E.g. memory corruption?

> synchronize_rcu() in io_destroy()  does not prevent it from returning
> as soon as lookup_ioctx drops the rcu_read_lock().
  Yes, exactly. So references obtained before synchronize_rcu() would be
completely fine and valid references and there would be no references after
synchronize_rcu() because they'd see 'dead' set. But looking at the code
again it still would not be enough because we could still race with
io_submit_one() adding new IO to the ioctx which will be freed just after
put_ioctx() in do_io_submit().

The patch below implements what I have in mind - it should be probably
split into two but I'd like to hear comments on that before doing these
cosmetic touches ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

>From 7a662553a44db06381d405a76426855b5507c61d Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 00:37:48 +0100
Subject: [PATCH] fs: Fix ioctx lookup races with io_destroy() in AIO

AIO code doesn't implement the rcu lookup pattern properly.
rcu_read_lock in lookup_ioctx() does not prevent refcount
going to zero, so we might take a refcount on a zero count
(possibly already freed) ioctx:

 CPU1						CPU2
lookup_ioctx()
  hlist_for_each_entry_rcu()
    if (ctx->user_id == ctx_id && !ctx->dead)
						io_destroy()
						  free ioctx
      get_ioctx(ctx);
      and return freed memory to innocent caller

Close this race by calling synchronize_rcu() in io_destroy().

Another race occurs when io_submit() races with io_destroy():

 CPU1						CPU2
io_submit()
  do_io_submit()
    ...
    ctx = lookup_ioctx(ctx_id);
						io_destroy()
    Now do_io_submit() holds the last reference to ctx.
    ...
    queue new AIO
    put_ioctx(ctx) - frees ctx with active AIOs

We solve this issue by checking whether ctx is being destroyed
in AIO submission path after adding new AIO to ctx. Then we
are guaranteed that either io_destroy() waits for new AIO or
we see that ctx is being destroyed and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/aio.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8c8f6c5..59ac692 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1225,6 +1225,17 @@ static void io_destroy(struct kioctx *ioctx)
 	if (likely(!was_dead))
 		put_ioctx(ioctx);	/* twice for the list */
 
+	/*
+	 * After this ioctx cannot be looked up because ioctx->dead is set.
+	 * But there could be still io_submit() running...
+	 */
+	synchronize_rcu();
+
+	/*
+	 * ... but once these functions finish, the io has been either
+	 * cancelled (if aio_get_req() has already run) or the ctx->dead
+	 * check in io_submit_one() triggers and no io is submitted.
+	 */
 	aio_cancel_all(ioctx);
 	wait_for_all_aios(ioctx);
 
@@ -1646,6 +1657,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 
 	spin_lock_irq(&ctx->ctx_lock);
+	/*
+	 * Raced with io_destroy()? See comments in io_destroy() for details.
+	 * The check is inside ctx->ctx_lock to avoid extra memory barrier
+	 * in this fast path...
+	 */
+	if (ctx->dead) {
+		spin_unlock_irq(&ctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_req;
+	}
 	aio_run_iocb(req);
 	if (!list_empty(&ctx->run_list)) {
 		/* drain the run list */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-18 23:52               ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2011-01-18 23:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Wed 19-01-11 09:17:23, Nick Piggin wrote:
> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 18-01-11 10:24:24, Nick Piggin wrote:
> >> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> > Nick Piggin <npiggin@gmail.com> writes:
> >> >> Do you agree with the theoretical problem? I didn't try to
> >> >> write a racer to break it yet. Inserting a delay before the
> >> >> get_ioctx might do the trick.
> >> >
> >> > I'm not convinced, no.  The last reference to the kioctx is always the
> >> > process, released in the exit_aio path, or via sys_io_destroy.  In both
> >> > cases, we cancel all aios, then wait for them all to complete before
> >> > dropping the final reference to the context.
> >>
> >> That wouldn't appear to prevent a concurrent thread from doing an
> >> io operation that requires ioctx lookup, and taking the last reference
> >> after the io_cancel thread drops the ref.
> >>
> >> > So, while I agree that what you wrote is better, I remain unconvinced of
> >> > it solving a real-world problem.  Feel free to push it in as a cleanup,
> >> > though.
> >>
> >> Well I think it has to be technically correct first. If there is indeed a
> >> guaranteed ref somehow, it just needs a comment.
> >  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
> > from the hash table and set ioctx->dead which is supposed to stop
> > lookup_ioctx() from finding it (see the !ctx->dead check in
> > lookup_ioctx()). There's even a comment in io_destroy() saying:
> >        /*
> >         * Wake up any waiters.  The setting of ctx->dead must be seen
> >         * by other CPUs at this point.  Right now, we rely on the
> >         * locking done by the above calls to ensure this consistency.
> >         */
> > But since lookup_ioctx() is called without any lock or barrier nothing
> > really seems to prevent the list traversal and ioctx->dead test to happen
> > before io_destroy() and get_ioctx() after io_destroy().
> >
> > But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
> > Because with your fix we could still return 'dead' ioctx and I don't think
> > we are supposed to do that...
> 
> With my fix we won't oops, I was a bit concerned about ->dead,
> yes but I don't know what semantics it is attempted to have there.
  But wouldn't it do something bad if the memory gets reallocated for
something else and set to non-zero? E.g. memory corruption?

> synchronize_rcu() in io_destroy()  does not prevent it from returning
> as soon as lookup_ioctx drops the rcu_read_lock().
  Yes, exactly. So references obtained before synchronize_rcu() would be
completely fine and valid references and there would be no references after
synchronize_rcu() because they'd see 'dead' set. But looking at the code
again it still would not be enough because we could still race with
io_submit_one() adding new IO to the ioctx which will be freed just after
put_ioctx() in do_io_submit().

The patch below implements what I have in mind - it should be probably
split into two but I'd like to hear comments on that before doing these
cosmetic touches ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

From 7a662553a44db06381d405a76426855b5507c61d Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 00:37:48 +0100
Subject: [PATCH] fs: Fix ioctx lookup races with io_destroy() in AIO

AIO code doesn't implement the rcu lookup pattern properly.
rcu_read_lock in lookup_ioctx() does not prevent refcount
going to zero, so we might take a refcount on a zero count
(possibly already freed) ioctx:

 CPU1						CPU2
lookup_ioctx()
  hlist_for_each_entry_rcu()
    if (ctx->user_id == ctx_id && !ctx->dead)
						io_destroy()
						  free ioctx
      get_ioctx(ctx);
      and return freed memory to innocent caller

Close this race by calling synchronize_rcu() in io_destroy().

Another race occurs when io_submit() races with io_destroy():

 CPU1						CPU2
io_submit()
  do_io_submit()
    ...
    ctx = lookup_ioctx(ctx_id);
						io_destroy()
    Now do_io_submit() holds the last reference to ctx.
    ...
    queue new AIO
    put_ioctx(ctx) - frees ctx with active AIOs

We solve this issue by checking whether ctx is being destroyed
in AIO submission path after adding new AIO to ctx. Then we
are guaranteed that either io_destroy() waits for new AIO or
we see that ctx is being destroyed and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/aio.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8c8f6c5..59ac692 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1225,6 +1225,17 @@ static void io_destroy(struct kioctx *ioctx)
 	if (likely(!was_dead))
 		put_ioctx(ioctx);	/* twice for the list */
 
+	/*
+	 * After this ioctx cannot be looked up because ioctx->dead is set.
+	 * But there could be still io_submit() running...
+	 */
+	synchronize_rcu();
+
+	/*
+	 * ... but once these functions finish, the io has been either
+	 * cancelled (if aio_get_req() has already run) or the ctx->dead
+	 * check in io_submit_one() triggers and no io is submitted.
+	 */
 	aio_cancel_all(ioctx);
 	wait_for_all_aios(ioctx);
 
@@ -1646,6 +1657,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 
 	spin_lock_irq(&ctx->ctx_lock);
+	/*
+	 * Raced with io_destroy()? See comments in io_destroy() for details.
+	 * The check is inside ctx->ctx_lock to avoid extra memory barrier
+	 * in this fast path...
+	 */
+	if (ctx->dead) {
+		spin_unlock_irq(&ctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_req;
+	}
 	aio_run_iocb(req);
 	if (!list_empty(&ctx->run_list)) {
 		/* drain the run list */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-18 23:52               ` Jan Kara
  (?)
@ 2011-01-19  0:20               ` Nick Piggin
  2011-01-19 13:21                 ` Jan Kara
  -1 siblings, 1 reply; 40+ messages in thread
From: Nick Piggin @ 2011-01-19  0:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Wed, Jan 19, 2011 at 10:52 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 19-01-11 09:17:23, Nick Piggin wrote:
>> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 18-01-11 10:24:24, Nick Piggin wrote:
>> >> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> >> > Nick Piggin <npiggin@gmail.com> writes:
>> >> >> Do you agree with the theoretical problem? I didn't try to
>> >> >> write a racer to break it yet. Inserting a delay before the
>> >> >> get_ioctx might do the trick.
>> >> >
>> >> > I'm not convinced, no.  The last reference to the kioctx is always the
>> >> > process, released in the exit_aio path, or via sys_io_destroy.  In both
>> >> > cases, we cancel all aios, then wait for them all to complete before
>> >> > dropping the final reference to the context.
>> >>
>> >> That wouldn't appear to prevent a concurrent thread from doing an
>> >> io operation that requires ioctx lookup, and taking the last reference
>> >> after the io_cancel thread drops the ref.
>> >>
>> >> > So, while I agree that what you wrote is better, I remain unconvinced of
>> >> > it solving a real-world problem.  Feel free to push it in as a cleanup,
>> >> > though.
>> >>
>> >> Well I think it has to be technically correct first. If there is indeed a
>> >> guaranteed ref somehow, it just needs a comment.
>> >  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
>> > from the hash table and set ioctx->dead which is supposed to stop
>> > lookup_ioctx() from finding it (see the !ctx->dead check in
>> > lookup_ioctx()). There's even a comment in io_destroy() saying:
>> >        /*
>> >         * Wake up any waiters.  The setting of ctx->dead must be seen
>> >         * by other CPUs at this point.  Right now, we rely on the
>> >         * locking done by the above calls to ensure this consistency.
>> >         */
>> > But since lookup_ioctx() is called without any lock or barrier nothing
>> > really seems to prevent the list traversal and ioctx->dead test to happen
>> > before io_destroy() and get_ioctx() after io_destroy().
>> >
>> > But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
>> > Because with your fix we could still return 'dead' ioctx and I don't think
>> > we are supposed to do that...
>>
>> With my fix we won't oops, I was a bit concerned about ->dead,
>> yes but I don't know what semantics it is attempted to have there.
>  But wouldn't it do something bad if the memory gets reallocated for
> something else and set to non-zero? E.g. memory corruption?

I don't see how it would with my patch.


>> synchronize_rcu() in io_destroy()  does not prevent it from returning
>> as soon as lookup_ioctx drops the rcu_read_lock().
>  Yes, exactly. So references obtained before synchronize_rcu() would be
> completely fine and valid references and there would be no references after
> synchronize_rcu() because they'd see 'dead' set. But looking at the code
> again it still would not be enough because we could still race with
> io_submit_one() adding new IO to the ioctx which will be freed just after
> put_ioctx() in do_io_submit().
>
> The patch below implements what I have in mind - it should be probably
> split into two but I'd like to hear comments on that before doing these
> cosmetic touches ;)

Well this seems to solve it too, but it is 2 problems here. It is changing
the semantics of io_destroy which requires the big synchronize_rcu()
hammer.

But I don't believe that's necessarily desirable, or required. In fact it is
explicitly not reuired because it only says that it _may_ cancel outstanding
requests.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19  0:20               ` Nick Piggin
@ 2011-01-19 13:21                 ` Jan Kara
  2011-01-19 16:03                   ` Nick Piggin
  2011-01-19 19:13                   ` Jeff Moyer
  0 siblings, 2 replies; 40+ messages in thread
From: Jan Kara @ 2011-01-19 13:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Wed 19-01-11 11:20:45, Nick Piggin wrote:
> On Wed, Jan 19, 2011 at 10:52 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 19-01-11 09:17:23, Nick Piggin wrote:
> >> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Tue 18-01-11 10:24:24, Nick Piggin wrote:
> >> >> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> >> > Nick Piggin <npiggin@gmail.com> writes:
> >> >> >> Do you agree with the theoretical problem? I didn't try to
> >> >> >> write a racer to break it yet. Inserting a delay before the
> >> >> >> get_ioctx might do the trick.
> >> >> >
> >> >> > I'm not convinced, no.  The last reference to the kioctx is always the
> >> >> > process, released in the exit_aio path, or via sys_io_destroy.  In both
> >> >> > cases, we cancel all aios, then wait for them all to complete before
> >> >> > dropping the final reference to the context.
> >> >>
> >> >> That wouldn't appear to prevent a concurrent thread from doing an
> >> >> io operation that requires ioctx lookup, and taking the last reference
> >> >> after the io_cancel thread drops the ref.
> >> >>
> >> >> > So, while I agree that what you wrote is better, I remain unconvinced of
> >> >> > it solving a real-world problem.  Feel free to push it in as a cleanup,
> >> >> > though.
> >> >>
> >> >> Well I think it has to be technically correct first. If there is indeed a
> >> >> guaranteed ref somehow, it just needs a comment.
> >> >  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
> >> > from the hash table and set ioctx->dead which is supposed to stop
> >> > lookup_ioctx() from finding it (see the !ctx->dead check in
> >> > lookup_ioctx()). There's even a comment in io_destroy() saying:
> >> >        /*
> >> >         * Wake up any waiters.  The setting of ctx->dead must be seen
> >> >         * by other CPUs at this point.  Right now, we rely on the
> >> >         * locking done by the above calls to ensure this consistency.
> >> >         */
> >> > But since lookup_ioctx() is called without any lock or barrier nothing
> >> > really seems to prevent the list traversal and ioctx->dead test to happen
> >> > before io_destroy() and get_ioctx() after io_destroy().
> >> >
> >> > But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
> >> > Because with your fix we could still return 'dead' ioctx and I don't think
> >> > we are supposed to do that...
> >>
> >> With my fix we won't oops, I was a bit concerned about ->dead,
> >> yes but I don't know what semantics it is attempted to have there.
> >  But wouldn't it do something bad if the memory gets reallocated for
> > something else and set to non-zero? E.g. memory corruption?
> 
> I don't see how it would with my patch.
  Ah, you are right. Since the whole structure gets freed only after the RCU
period expires, we are guaranteed to see 0 in ctx->users until we drop
rcu_read_lock. Maybe a comment like the above would be useful at the place
where you use try_get_ioctx() but your patch works.

> >> synchronize_rcu() in io_destroy()  does not prevent it from returning
> >> as soon as lookup_ioctx drops the rcu_read_lock().
> >  Yes, exactly. So references obtained before synchronize_rcu() would be
> > completely fine and valid references and there would be no references after
> > synchronize_rcu() because they'd see 'dead' set. But looking at the code
> > again it still would not be enough because we could still race with
> > io_submit_one() adding new IO to the ioctx which will be freed just after
> > put_ioctx() in do_io_submit().
> >
> > The patch below implements what I have in mind - it should be probably
> > split into two but I'd like to hear comments on that before doing these
> > cosmetic touches ;)
> 
> Well this seems to solve it too, but it is 2 problems here. It is changing
> the semantics of io_destroy which requires the big synchronize_rcu()
> hammer.
> 
> But I don't believe that's necessarily desirable, or required. In fact it is
> explicitly not reuired because it only says that it _may_ cancel outstanding
> requests.
  Well, we are not required to cancel all the outstanding AIO because of the
API requirement, that's granted. But we must do it because of the way how
the code is written. Outstanding IO requests reference ioctx but they are
not counted in ctx->users but in ctx->reqs_active. So the code relies on
the fact that the reference held by the hash table protects ctx from being
freed and io_destroy() waits for requests before dropping the last
reference to ctx. But there's the second race I describe making it possible
for new IO to be created after io_destroy() has waited for all IO to
finish...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 13:21                 ` Jan Kara
@ 2011-01-19 16:03                   ` Nick Piggin
  2011-01-19 16:50                     ` Jan Kara
  2011-01-19 19:13                   ` Jeff Moyer
  1 sibling, 1 reply; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 16:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Thu, Jan 20, 2011 at 12:21 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 19-01-11 11:20:45, Nick Piggin wrote:
>> On Wed, Jan 19, 2011 at 10:52 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 19-01-11 09:17:23, Nick Piggin wrote:
>> >> On Wed, Jan 19, 2011 at 6:01 AM, Jan Kara <jack@suse.cz> wrote:
>> >> > On Tue 18-01-11 10:24:24, Nick Piggin wrote:
>> >> >> On Tue, Jan 18, 2011 at 6:07 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> >> >> > Nick Piggin <npiggin@gmail.com> writes:
>> >> >> >> Do you agree with the theoretical problem? I didn't try to
>> >> >> >> write a racer to break it yet. Inserting a delay before the
>> >> >> >> get_ioctx might do the trick.
>> >> >> >
>> >> >> > I'm not convinced, no.  The last reference to the kioctx is always the
>> >> >> > process, released in the exit_aio path, or via sys_io_destroy.  In both
>> >> >> > cases, we cancel all aios, then wait for them all to complete before
>> >> >> > dropping the final reference to the context.
>> >> >>
>> >> >> That wouldn't appear to prevent a concurrent thread from doing an
>> >> >> io operation that requires ioctx lookup, and taking the last reference
>> >> >> after the io_cancel thread drops the ref.
>> >> >>
>> >> >> > So, while I agree that what you wrote is better, I remain unconvinced of
>> >> >> > it solving a real-world problem.  Feel free to push it in as a cleanup,
>> >> >> > though.
>> >> >>
>> >> >> Well I think it has to be technically correct first. If there is indeed a
>> >> >> guaranteed ref somehow, it just needs a comment.
>> >> >  Hmm, the code in io_destroy() indeed looks fishy. We delete the ioctx
>> >> > from the hash table and set ioctx->dead which is supposed to stop
>> >> > lookup_ioctx() from finding it (see the !ctx->dead check in
>> >> > lookup_ioctx()). There's even a comment in io_destroy() saying:
>> >> >        /*
>> >> >         * Wake up any waiters.  The setting of ctx->dead must be seen
>> >> >         * by other CPUs at this point.  Right now, we rely on the
>> >> >         * locking done by the above calls to ensure this consistency.
>> >> >         */
>> >> > But since lookup_ioctx() is called without any lock or barrier nothing
>> >> > really seems to prevent the list traversal and ioctx->dead test to happen
>> >> > before io_destroy() and get_ioctx() after io_destroy().
>> >> >
>> >> > But wouldn't the right fix be to call synchronize_rcu() in io_destroy()?
>> >> > Because with your fix we could still return 'dead' ioctx and I don't think
>> >> > we are supposed to do that...
>> >>
>> >> With my fix we won't oops, I was a bit concerned about ->dead,
>> >> yes but I don't know what semantics it is attempted to have there.
>> >  But wouldn't it do something bad if the memory gets reallocated for
>> > something else and set to non-zero? E.g. memory corruption?
>>
>> I don't see how it would with my patch.
>  Ah, you are right. Since the whole structure gets freed only after the RCU
> period expires, we are guaranteed to see 0 in ctx->users until we drop
> rcu_read_lock. Maybe a comment like the above would be useful at the place
> where you use try_get_ioctx() but your patch works.
>
>> >> synchronize_rcu() in io_destroy()  does not prevent it from returning
>> >> as soon as lookup_ioctx drops the rcu_read_lock().
>> >  Yes, exactly. So references obtained before synchronize_rcu() would be
>> > completely fine and valid references and there would be no references after
>> > synchronize_rcu() because they'd see 'dead' set. But looking at the code
>> > again it still would not be enough because we could still race with
>> > io_submit_one() adding new IO to the ioctx which will be freed just after
>> > put_ioctx() in do_io_submit().
>> >
>> > The patch below implements what I have in mind - it should be probably
>> > split into two but I'd like to hear comments on that before doing these
>> > cosmetic touches ;)
>>
>> Well this seems to solve it too, but it is 2 problems here. It is changing
>> the semantics of io_destroy which requires the big synchronize_rcu()
>> hammer.
>>
>> But I don't believe that's necessarily desirable, or required. In fact it is
>> explicitly not reuired because it only says that it _may_ cancel outstanding
>> requests.
>  Well, we are not required to cancel all the outstanding AIO because of the
> API requirement, that's granted. But we must do it because of the way how
> the code is written. Outstanding IO requests reference ioctx but they are
> not counted in ctx->users but in ctx->reqs_active. So the code relies on
> the fact that the reference held by the hash table protects ctx from being
> freed and io_destroy() waits for requests before dropping the last
> reference to ctx. But there's the second race I describe making it possible
> for new IO to be created after io_destroy() has waited for all IO to
> finish...

Yes there is that race too I agree. I just didn't follow through the code far
enough to see it was a problem -- I thought it was by design.

I'd like to solve it without synchronize_rcu() though.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 16:03                   ` Nick Piggin
@ 2011-01-19 16:50                     ` Jan Kara
  2011-01-19 17:37                         ` Nick Piggin
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2011-01-19 16:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Thu 20-01-11 03:03:23, Nick Piggin wrote:
> On Thu, Jan 20, 2011 at 12:21 AM, Jan Kara <jack@suse.cz> wrote:
> >  Well, we are not required to cancel all the outstanding AIO because of the
> > API requirement, that's granted. But we must do it because of the way how
> > the code is written. Outstanding IO requests reference ioctx but they are
> > not counted in ctx->users but in ctx->reqs_active. So the code relies on
> > the fact that the reference held by the hash table protects ctx from being
> > freed and io_destroy() waits for requests before dropping the last
> > reference to ctx. But there's the second race I describe making it possible
> > for new IO to be created after io_destroy() has waited for all IO to
> > finish...
> 
> Yes there is that race too I agree. I just didn't follow through the code far
> enough to see it was a problem -- I thought it was by design.
> 
> I'd like to solve it without synchronize_rcu() though.
  Ah, OK. I don't find io_destroy() performance critical but I can
understand that you need not like synchronize_rcu() there. ;) Then it
should be possible to make IO requests count in ctx->users which would
solve the race as well. We'd just have to be prepared that request
completion might put the last reference to ioctx and free it but that
shouldn't be an issue. Do you like that solution better?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 16:50                     ` Jan Kara
@ 2011-01-19 17:37                         ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 17:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Thu, Jan 20, 2011 at 3:50 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-01-11 03:03:23, Nick Piggin wrote:
>> On Thu, Jan 20, 2011 at 12:21 AM, Jan Kara <jack@suse.cz> wrote:
>> >  Well, we are not required to cancel all the outstanding AIO because of the
>> > API requirement, that's granted. But we must do it because of the way how
>> > the code is written. Outstanding IO requests reference ioctx but they are
>> > not counted in ctx->users but in ctx->reqs_active. So the code relies on
>> > the fact that the reference held by the hash table protects ctx from being
>> > freed and io_destroy() waits for requests before dropping the last
>> > reference to ctx. But there's the second race I describe making it possible
>> > for new IO to be created after io_destroy() has waited for all IO to
>> > finish...
>>
>> Yes there is that race too I agree. I just didn't follow through the code far
>> enough to see it was a problem -- I thought it was by design.
>>
>> I'd like to solve it without synchronize_rcu() though.
>  Ah, OK. I don't find io_destroy() performance critical but I can

Probably not performance critical, but it could be a very
large slowdown so somebody might complain.

> understand that you need not like synchronize_rcu() there. ;) Then it
> should be possible to make IO requests count in ctx->users which would
> solve the race as well. We'd just have to be prepared that request
> completion might put the last reference to ioctx and free it but that
> shouldn't be an issue. Do you like that solution better?

I think so, if it can be done without slowing things down
and adding locks or atomics if possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-19 17:37                         ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 17:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Thu, Jan 20, 2011 at 3:50 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-01-11 03:03:23, Nick Piggin wrote:
>> On Thu, Jan 20, 2011 at 12:21 AM, Jan Kara <jack@suse.cz> wrote:
>> >  Well, we are not required to cancel all the outstanding AIO because of the
>> > API requirement, that's granted. But we must do it because of the way how
>> > the code is written. Outstanding IO requests reference ioctx but they are
>> > not counted in ctx->users but in ctx->reqs_active. So the code relies on
>> > the fact that the reference held by the hash table protects ctx from being
>> > freed and io_destroy() waits for requests before dropping the last
>> > reference to ctx. But there's the second race I describe making it possible
>> > for new IO to be created after io_destroy() has waited for all IO to
>> > finish...
>>
>> Yes there is that race too I agree. I just didn't follow through the code far
>> enough to see it was a problem -- I thought it was by design.
>>
>> I'd like to solve it without synchronize_rcu() though.
>  Ah, OK. I don't find io_destroy() performance critical but I can

Probably not performance critical, but it could be a very
large slowdown so somebody might complain.

> understand that you need not like synchronize_rcu() there. ;) Then it
> should be possible to make IO requests count in ctx->users which would
> solve the race as well. We'd just have to be prepared that request
> completion might put the last reference to ioctx and free it but that
> shouldn't be an issue. Do you like that solution better?

I think so, if it can be done without slowing things down
and adding locks or atomics if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 13:21                 ` Jan Kara
  2011-01-19 16:03                   ` Nick Piggin
@ 2011-01-19 19:13                   ` Jeff Moyer
  2011-01-19 19:46                     ` Jeff Moyer
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2011-01-19 19:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, linux-kernel

Jan Kara <jack@suse.cz> writes:

>  But there's the second race I describe making it possible
> for new IO to be created after io_destroy() has waited for all IO to
> finish...

Can't that be solved by introducing memory barriers around the accesses
to ->dead?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 19:13                   ` Jeff Moyer
@ 2011-01-19 19:46                     ` Jeff Moyer
  2011-01-19 20:18                       ` Nick Piggin
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2011-01-19 19:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, linux-kernel

Jeff Moyer <jmoyer@redhat.com> writes:

> Jan Kara <jack@suse.cz> writes:
>
>>  But there's the second race I describe making it possible
>> for new IO to be created after io_destroy() has waited for all IO to
>> finish...
>
> Can't that be solved by introducing memory barriers around the accesses
> to ->dead?

Upon further consideration, I don't think so.

Given the options, I think adding the synchronize rcu to the io_destroy
path is the best way forward.  You're already waiting for a bunch of
queued I/O to finish, so there is no guarantee that you're going to
finish that call quickly.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 19:46                     ` Jeff Moyer
@ 2011-01-19 20:18                       ` Nick Piggin
  2011-01-19 20:32                         ` Jeff Moyer
  0 siblings, 1 reply; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 20:18 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Jeff Moyer <jmoyer@redhat.com> writes:
>
>> Jan Kara <jack@suse.cz> writes:
>>
>>>  But there's the second race I describe making it possible
>>> for new IO to be created after io_destroy() has waited for all IO to
>>> finish...
>>
>> Can't that be solved by introducing memory barriers around the accesses
>> to ->dead?
>
> Upon further consideration, I don't think so.
>
> Given the options, I think adding the synchronize rcu to the io_destroy
> path is the best way forward.  You're already waiting for a bunch of
> queued I/O to finish, so there is no guarantee that you're going to
> finish that call quickly.

I think synchronize_rcu() is not something to sprinkle around outside
very slow paths. It can be done without synchronize_rcu.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 20:18                       ` Nick Piggin
@ 2011-01-19 20:32                         ` Jeff Moyer
  2011-01-19 20:45                           ` Nick Piggin
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff Moyer @ 2011-01-19 20:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

Nick Piggin <npiggin@gmail.com> writes:

> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Jeff Moyer <jmoyer@redhat.com> writes:
>>
>>> Jan Kara <jack@suse.cz> writes:
>>>
>>>>  But there's the second race I describe making it possible
>>>> for new IO to be created after io_destroy() has waited for all IO to
>>>> finish...
>>>
>>> Can't that be solved by introducing memory barriers around the accesses
>>> to ->dead?
>>
>> Upon further consideration, I don't think so.
>>
>> Given the options, I think adding the synchronize rcu to the io_destroy
>> path is the best way forward.  You're already waiting for a bunch of
>> queued I/O to finish, so there is no guarantee that you're going to
>> finish that call quickly.
>
> I think synchronize_rcu() is not something to sprinkle around outside
> very slow paths. It can be done without synchronize_rcu.

I'm not sure I understand what you're saying.  Do you mean to imply that
io_destroy is not a very slow path?  Because it is.  I prefer a solution
that doesn't re-architecht things in order to solve a theoretical issue
that's never been observed.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 20:32                         ` Jeff Moyer
@ 2011-01-19 20:45                           ` Nick Piggin
  2011-01-19 21:03                               ` Jeff Moyer
  0 siblings, 1 reply; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 20:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

On Thu, Jan 20, 2011 at 7:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Jeff Moyer <jmoyer@redhat.com> writes:
>>>
>>>> Jan Kara <jack@suse.cz> writes:
>>>>
>>>>>  But there's the second race I describe making it possible
>>>>> for new IO to be created after io_destroy() has waited for all IO to
>>>>> finish...
>>>>
>>>> Can't that be solved by introducing memory barriers around the accesses
>>>> to ->dead?
>>>
>>> Upon further consideration, I don't think so.
>>>
>>> Given the options, I think adding the synchronize rcu to the io_destroy
>>> path is the best way forward.  You're already waiting for a bunch of
>>> queued I/O to finish, so there is no guarantee that you're going to
>>> finish that call quickly.
>>
>> I think synchronize_rcu() is not something to sprinkle around outside
>> very slow paths. It can be done without synchronize_rcu.
>
> I'm not sure I understand what you're saying.  Do you mean to imply that
> io_destroy is not a very slow path?  Because it is.  I prefer a solution
> that doesn't re-architecht things in order to solve a theoretical issue
> that's never been observed.

Even something that happens once per process lifetime, like in fork/exit
is not necessarily suitable for RCU. I don't know exactly how all programs
use io_destroy -- of the small number that do, probably an even smaller
number would care here. But I don't think it simplifies things enough to
use synchronize_rcu for it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 20:45                           ` Nick Piggin
@ 2011-01-19 21:03                               ` Jeff Moyer
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2011-01-19 21:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

Nick Piggin <npiggin@gmail.com> writes:

> On Thu, Jan 20, 2011 at 7:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Jeff Moyer <jmoyer@redhat.com> writes:
>>>>
>>>>> Jan Kara <jack@suse.cz> writes:
>>>>>
>>>>>>  But there's the second race I describe making it possible
>>>>>> for new IO to be created after io_destroy() has waited for all IO to
>>>>>> finish...
>>>>>
>>>>> Can't that be solved by introducing memory barriers around the accesses
>>>>> to ->dead?
>>>>
>>>> Upon further consideration, I don't think so.
>>>>
>>>> Given the options, I think adding the synchronize rcu to the io_destroy
>>>> path is the best way forward.  You're already waiting for a bunch of
>>>> queued I/O to finish, so there is no guarantee that you're going to
>>>> finish that call quickly.
>>>
>>> I think synchronize_rcu() is not something to sprinkle around outside
>>> very slow paths. It can be done without synchronize_rcu.
>>
>> I'm not sure I understand what you're saying.  Do you mean to imply that
>> io_destroy is not a very slow path?  Because it is.  I prefer a solution
>> that doesn't re-architecht things in order to solve a theoretical issue
>> that's never been observed.
>
> Even something that happens once per process lifetime, like in fork/exit
> is not necessarily suitable for RCU.

Now you've really lost me.  ;-)  Processes which utilize the in-kernel
aio interface typically create an ioctx at process startup, use that for
submitting all of their io, then destroy it on exit.  Think of a
database.  Every time you call io_submit, you're doing a lookup of the
ioctx.

> I don't know exactly how all programs use io_destroy -- of the small
> number that do, probably an even smaller number would care here. But I
> don't think it simplifies things enough to use synchronize_rcu for it.

Above it sounded like you didn't think AIO should be using RCU at all.
Here it sounds like you are just against synchronize_rcu.  Which is it?
And if the latter, then please tell me in what cases you feel one would
be justified in calling synchronize_rcu.  For now, I simply disagree
with you.  As I said before, you're already potentially waiting for disk
I/O to complete.  It doesn't get much worse than that for latency.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-19 21:03                               ` Jeff Moyer
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2011-01-19 21:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

Nick Piggin <npiggin@gmail.com> writes:

> On Thu, Jan 20, 2011 at 7:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Nick Piggin <npiggin@gmail.com> writes:
>>
>>> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>> Jeff Moyer <jmoyer@redhat.com> writes:
>>>>
>>>>> Jan Kara <jack@suse.cz> writes:
>>>>>
>>>>>>  But there's the second race I describe making it possible
>>>>>> for new IO to be created after io_destroy() has waited for all IO to
>>>>>> finish...
>>>>>
>>>>> Can't that be solved by introducing memory barriers around the accesses
>>>>> to ->dead?
>>>>
>>>> Upon further consideration, I don't think so.
>>>>
>>>> Given the options, I think adding the synchronize rcu to the io_destroy
>>>> path is the best way forward.  You're already waiting for a bunch of
>>>> queued I/O to finish, so there is no guarantee that you're going to
>>>> finish that call quickly.
>>>
>>> I think synchronize_rcu() is not something to sprinkle around outside
>>> very slow paths. It can be done without synchronize_rcu.
>>
>> I'm not sure I understand what you're saying.  Do you mean to imply that
>> io_destroy is not a very slow path?  Because it is.  I prefer a solution
>> that doesn't re-architecht things in order to solve a theoretical issue
>> that's never been observed.
>
> Even something that happens once per process lifetime, like in fork/exit
> is not necessarily suitable for RCU.

Now you've really lost me.  ;-)  Processes which utilize the in-kernel
aio interface typically create an ioctx at process startup, use that for
submitting all of their io, then destroy it on exit.  Think of a
database.  Every time you call io_submit, you're doing a lookup of the
ioctx.

> I don't know exactly how all programs use io_destroy -- of the small
> number that do, probably an even smaller number would care here. But I
> don't think it simplifies things enough to use synchronize_rcu for it.

Above it sounded like you didn't think AIO should be using RCU at all.
Here it sounds like you are just against synchronize_rcu.  Which is it?
And if the latter, then please tell me in what cases you feel one would
be justified in calling synchronize_rcu.  For now, I simply disagree
with you.  As I said before, you're already potentially waiting for disk
I/O to complete.  It doesn't get much worse than that for latency.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 21:03                               ` Jeff Moyer
@ 2011-01-19 21:20                                 ` Nick Piggin
  -1 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 21:20 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On Thu, Jan 20, 2011 at 7:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Nick Piggin <npiggin@gmail.com> writes:
>>>
>>>> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>>> Jeff Moyer <jmoyer@redhat.com> writes:
>>>>>
>>>>>> Jan Kara <jack@suse.cz> writes:
>>>>>>
>>>>>>>  But there's the second race I describe making it possible
>>>>>>> for new IO to be created after io_destroy() has waited for all IO to
>>>>>>> finish...
>>>>>>
>>>>>> Can't that be solved by introducing memory barriers around the accesses
>>>>>> to ->dead?
>>>>>
>>>>> Upon further consideration, I don't think so.
>>>>>
>>>>> Given the options, I think adding the synchronize rcu to the io_destroy
>>>>> path is the best way forward.  You're already waiting for a bunch of
>>>>> queued I/O to finish, so there is no guarantee that you're going to
>>>>> finish that call quickly.
>>>>
>>>> I think synchronize_rcu() is not something to sprinkle around outside
>>>> very slow paths. It can be done without synchronize_rcu.
>>>
>>> I'm not sure I understand what you're saying.  Do you mean to imply that
>>> io_destroy is not a very slow path?  Because it is.  I prefer a solution
>>> that doesn't re-architecht things in order to solve a theoretical issue
>>> that's never been observed.
>>
>> Even something that happens once per process lifetime, like in fork/exit
>> is not necessarily suitable for RCU.
>
> Now you've really lost me.  ;-)  Processes which utilize the in-kernel
> aio interface typically create an ioctx at process startup, use that for
> submitting all of their io, then destroy it on exit.  Think of a
> database.  Every time you call io_submit, you're doing a lookup of the
> ioctx.
>
>> I don't know exactly how all programs use io_destroy -- of the small
>> number that do, probably an even smaller number would care here. But I
>> don't think it simplifies things enough to use synchronize_rcu for it.
>
> Above it sounded like you didn't think AIO should be using RCU at all.

synchronize_rcu of course, not RCU (typo).

> Here it sounds like you are just against synchronize_rcu.  Which is it?
> And if the latter, then please tell me in what cases you feel one would
> be justified in calling synchronize_rcu.  For now, I simply disagree
> with you.  As I said before, you're already potentially waiting for disk
> I/O to complete.  It doesn't get much worse than that for latency.

I think synchronize_rcu should firstly not be used unless it gives a good
simplification, or speedup in fastpath.

When that is satified, then it is a question of exactly what kind of slow
path it should be used in. I don't think it should be used in process-
synchronous code (eg syscalls) except for error cases, resource
exhaustion, management syscalls (like module unload).

For example "it's waiting for IO anyway" is not a good reason, IMO.
Firstly because it may not be waiting for a 10ms disk IO, it may be
waiting for anything up to an in-RAM device. Secondly because it
could be quite slow depending on the RCU model used.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-19 21:20                                 ` Nick Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-19 21:20 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel, Paul E. McKenney

On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Nick Piggin <npiggin@gmail.com> writes:
>
>> On Thu, Jan 20, 2011 at 7:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> Nick Piggin <npiggin@gmail.com> writes:
>>>
>>>> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>>>> Jeff Moyer <jmoyer@redhat.com> writes:
>>>>>
>>>>>> Jan Kara <jack@suse.cz> writes:
>>>>>>
>>>>>>>  But there's the second race I describe making it possible
>>>>>>> for new IO to be created after io_destroy() has waited for all IO to
>>>>>>> finish...
>>>>>>
>>>>>> Can't that be solved by introducing memory barriers around the accesses
>>>>>> to ->dead?
>>>>>
>>>>> Upon further consideration, I don't think so.
>>>>>
>>>>> Given the options, I think adding the synchronize rcu to the io_destroy
>>>>> path is the best way forward.  You're already waiting for a bunch of
>>>>> queued I/O to finish, so there is no guarantee that you're going to
>>>>> finish that call quickly.
>>>>
>>>> I think synchronize_rcu() is not something to sprinkle around outside
>>>> very slow paths. It can be done without synchronize_rcu.
>>>
>>> I'm not sure I understand what you're saying.  Do you mean to imply that
>>> io_destroy is not a very slow path?  Because it is.  I prefer a solution
>>> that doesn't re-architecht things in order to solve a theoretical issue
>>> that's never been observed.
>>
>> Even something that happens once per process lifetime, like in fork/exit
>> is not necessarily suitable for RCU.
>
> Now you've really lost me.  ;-)  Processes which utilize the in-kernel
> aio interface typically create an ioctx at process startup, use that for
> submitting all of their io, then destroy it on exit.  Think of a
> database.  Every time you call io_submit, you're doing a lookup of the
> ioctx.
>
>> I don't know exactly how all programs use io_destroy -- of the small
>> number that do, probably an even smaller number would care here. But I
>> don't think it simplifies things enough to use synchronize_rcu for it.
>
> Above it sounded like you didn't think AIO should be using RCU at all.

synchronize_rcu of course, not RCU (typo).

> Here it sounds like you are just against synchronize_rcu.  Which is it?
> And if the latter, then please tell me in what cases you feel one would
> be justified in calling synchronize_rcu.  For now, I simply disagree
> with you.  As I said before, you're already potentially waiting for disk
> I/O to complete.  It doesn't get much worse than that for latency.

I think synchronize_rcu should firstly not be used unless it gives a good
simplification, or speedup in fastpath.

When that is satified, then it is a question of exactly what kind of slow
path it should be used in. I don't think it should be used in process-
synchronous code (eg syscalls) except for error cases, resource
exhaustion, management syscalls (like module unload).

For example "it's waiting for IO anyway" is not a good reason, IMO.
Firstly because it may not be waiting for a 10ms disk IO, it may be
waiting for anything up to an in-RAM device. Secondly because it
could be quite slow depending on the RCU model used.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 21:20                                 ` Nick Piggin
  (?)
@ 2011-01-20  4:03                                 ` Paul E. McKenney
  2011-01-20 18:31                                   ` Nick Piggin
  -1 siblings, 1 reply; 40+ messages in thread
From: Paul E. McKenney @ 2011-01-20  4:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel

On Thu, Jan 20, 2011 at 08:20:00AM +1100, Nick Piggin wrote:
> On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Nick Piggin <npiggin@gmail.com> writes:
> >
> >> On Thu, Jan 20, 2011 at 7:32 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >>> Nick Piggin <npiggin@gmail.com> writes:
> >>>
> >>>> On Thu, Jan 20, 2011 at 6:46 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >>>>> Jeff Moyer <jmoyer@redhat.com> writes:
> >>>>>
> >>>>>> Jan Kara <jack@suse.cz> writes:
> >>>>>>
> >>>>>>>  But there's the second race I describe making it possible
> >>>>>>> for new IO to be created after io_destroy() has waited for all IO to
> >>>>>>> finish...
> >>>>>>
> >>>>>> Can't that be solved by introducing memory barriers around the accesses
> >>>>>> to ->dead?
> >>>>>
> >>>>> Upon further consideration, I don't think so.
> >>>>>
> >>>>> Given the options, I think adding the synchronize rcu to the io_destroy
> >>>>> path is the best way forward.  You're already waiting for a bunch of
> >>>>> queued I/O to finish, so there is no guarantee that you're going to
> >>>>> finish that call quickly.
> >>>>
> >>>> I think synchronize_rcu() is not something to sprinkle around outside
> >>>> very slow paths. It can be done without synchronize_rcu.
> >>>
> >>> I'm not sure I understand what you're saying.  Do you mean to imply that
> >>> io_destroy is not a very slow path?  Because it is.  I prefer a solution
> >>> that doesn't re-architecht things in order to solve a theoretical issue
> >>> that's never been observed.
> >>
> >> Even something that happens once per process lifetime, like in fork/exit
> >> is not necessarily suitable for RCU.
> >
> > Now you've really lost me.  ;-)  Processes which utilize the in-kernel
> > aio interface typically create an ioctx at process startup, use that for
> > submitting all of their io, then destroy it on exit.  Think of a
> > database.  Every time you call io_submit, you're doing a lookup of the
> > ioctx.
> >
> >> I don't know exactly how all programs use io_destroy -- of the small
> >> number that do, probably an even smaller number would care here. But I
> >> don't think it simplifies things enough to use synchronize_rcu for it.
> >
> > Above it sounded like you didn't think AIO should be using RCU at all.
> 
> synchronize_rcu of course, not RCU (typo).

I think that Nick is suggesting that call_rcu() be used instead.
Perhaps also very sparing use of synchronize_rcu_expedited(), which
is faster than synchronize_rcu(), but which which uses more CPU time.

							Thanx, Paul

> > Here it sounds like you are just against synchronize_rcu.  Which is it?
> > And if the latter, then please tell me in what cases you feel one would
> > be justified in calling synchronize_rcu.  For now, I simply disagree
> > with you.  As I said before, you're already potentially waiting for disk
> > I/O to complete.  It doesn't get much worse than that for latency.
> 
> I think synchronize_rcu should firstly not be used unless it gives a good
> simplification, or speedup in fastpath.
> 
> When that is satified, then it is a question of exactly what kind of slow
> path it should be used in. I don't think it should be used in process-
> synchronous code (eg syscalls) except for error cases, resource
> exhaustion, management syscalls (like module unload).
> 
> For example "it's waiting for IO anyway" is not a good reason, IMO.
> Firstly because it may not be waiting for a 10ms disk IO, it may be
> waiting for anything up to an in-RAM device. Secondly because it
> could be quite slow depending on the RCU model used.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20  4:03                                 ` Paul E. McKenney
@ 2011-01-20 18:31                                   ` Nick Piggin
  2011-01-20 20:02                                     ` Paul E. McKenney
  2011-01-20 20:16                                     ` Jan Kara
  0 siblings, 2 replies; 40+ messages in thread
From: Nick Piggin @ 2011-01-20 18:31 UTC (permalink / raw)
  To: paulmck; +Cc: Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel

On Thu, Jan 20, 2011 at 3:03 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Jan 20, 2011 at 08:20:00AM +1100, Nick Piggin wrote:
>> On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> >> I don't know exactly how all programs use io_destroy -- of the small
>> >> number that do, probably an even smaller number would care here. But I
>> >> don't think it simplifies things enough to use synchronize_rcu for it.
>> >
>> > Above it sounded like you didn't think AIO should be using RCU at all.
>>
>> synchronize_rcu of course, not RCU (typo).
>
> I think that Nick is suggesting that call_rcu() be used instead.
> Perhaps also very sparing use of synchronize_rcu_expedited(), which
> is faster than synchronize_rcu(), but which which uses more CPU time.

call_rcu() is the obvious alternative, yes.

Basically, once we give in to synchronize_rcu() we're basically giving
up. That's certainly a very good tradeoff for something like filesystem
unregistration or module unload, it buys big simplifications in real
fastpaths. But I just don't think it should be taken lightly.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20 18:31                                   ` Nick Piggin
@ 2011-01-20 20:02                                     ` Paul E. McKenney
  2011-01-20 20:15                                       ` Eric Dumazet
  2011-01-20 20:16                                     ` Jan Kara
  1 sibling, 1 reply; 40+ messages in thread
From: Paul E. McKenney @ 2011-01-20 20:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel, linux-kernel

On Fri, Jan 21, 2011 at 05:31:53AM +1100, Nick Piggin wrote:
> On Thu, Jan 20, 2011 at 3:03 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Jan 20, 2011 at 08:20:00AM +1100, Nick Piggin wrote:
> >> On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> >> I don't know exactly how all programs use io_destroy -- of the small
> >> >> number that do, probably an even smaller number would care here. But I
> >> >> don't think it simplifies things enough to use synchronize_rcu for it.
> >> >
> >> > Above it sounded like you didn't think AIO should be using RCU at all.
> >>
> >> synchronize_rcu of course, not RCU (typo).
> >
> > I think that Nick is suggesting that call_rcu() be used instead.
> > Perhaps also very sparing use of synchronize_rcu_expedited(), which
> > is faster than synchronize_rcu(), but which which uses more CPU time.
> 
> call_rcu() is the obvious alternative, yes.
> 
> Basically, once we give in to synchronize_rcu() we're basically giving
> up. That's certainly a very good tradeoff for something like filesystem
> unregistration or module unload, it buys big simplifications in real
> fastpaths. But I just don't think it should be taken lightly.

Makes sense to me!

BTW, on your earlier usage classification:

> I think synchronize_rcu should firstly not be used unless it gives a good
> simplification, or speedup in fastpath.
>
> When that is satified, then it is a question of exactly what kind of slow
> path it should be used in. I don't think it should be used in process-
> synchronous code (eg syscalls) except for error cases, resource
> exhaustion, management syscalls (like module unload).

I don't have any feedback either way on your guidance to where
synchronize_rcu() should be used, as I believe that it depends a lot
on the details of usage, and would vary from one part of the kernel
to another, and possibly also over time.

But I am very glad to see that you have been thinking about it and
that you are putting forth some clear guidelines!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20 20:02                                     ` Paul E. McKenney
@ 2011-01-20 20:15                                       ` Eric Dumazet
  2011-01-21 21:22                                         ` Paul E. McKenney
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2011-01-20 20:15 UTC (permalink / raw)
  To: paulmck
  Cc: Nick Piggin, Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel,
	linux-kernel

Le jeudi 20 janvier 2011 à 12:02 -0800, Paul E. McKenney a écrit :
> On Fri, Jan 21, 2011 at 05:31:53AM +1100, Nick Piggin wrote:
> > call_rcu() is the obvious alternative, yes.
> > 
> > Basically, once we give in to synchronize_rcu() we're basically giving
> > up. That's certainly a very good tradeoff for something like filesystem
> > unregistration or module unload, it buys big simplifications in real
> > fastpaths. But I just don't think it should be taken lightly.
> 
> Makes sense to me!
> 
> BTW, on your earlier usage classification:
> 
> > I think synchronize_rcu should firstly not be used unless it gives a good
> > simplification, or speedup in fastpath.
> >
> > When that is satified, then it is a question of exactly what kind of slow
> > path it should be used in. I don't think it should be used in process-
> > synchronous code (eg syscalls) except for error cases, resource
> > exhaustion, management syscalls (like module unload).
> 
> I don't have any feedback either way on your guidance to where
> synchronize_rcu() should be used, as I believe that it depends a lot
> on the details of usage, and would vary from one part of the kernel
> to another, and possibly also over time.
> 

Sometime, a mixture of call_rcu() and synchronize_rcu() is used, to have
a limit on pending callbacks (eating too much memory)

net/ipv4/fib_trie.c for example issues call_rcu() most of the time, but
is able to trigger one synchronize_rcu() if more than XXX (128) pages of
memory were queued in rcu queues.

For details, check commit c3059477fce2d956
(ipv4: Use synchronize_rcu() during trie_rebalance())




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20 18:31                                   ` Nick Piggin
  2011-01-20 20:02                                     ` Paul E. McKenney
@ 2011-01-20 20:16                                     ` Jan Kara
  2011-01-20 21:16                                       ` Jeff Moyer
  2011-02-01 16:24                                       ` Jan Kara
  1 sibling, 2 replies; 40+ messages in thread
From: Jan Kara @ 2011-01-20 20:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

On Fri 21-01-11 05:31:53, Nick Piggin wrote:
> On Thu, Jan 20, 2011 at 3:03 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Jan 20, 2011 at 08:20:00AM +1100, Nick Piggin wrote:
> >> On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> >> I don't know exactly how all programs use io_destroy -- of the small
> >> >> number that do, probably an even smaller number would care here. But I
> >> >> don't think it simplifies things enough to use synchronize_rcu for it.
> >> >
> >> > Above it sounded like you didn't think AIO should be using RCU at all.
> >>
> >> synchronize_rcu of course, not RCU (typo).
> >
> > I think that Nick is suggesting that call_rcu() be used instead.
> > Perhaps also very sparing use of synchronize_rcu_expedited(), which
> > is faster than synchronize_rcu(), but which which uses more CPU time.
> 
> call_rcu() is the obvious alternative, yes.
> 
> Basically, once we give in to synchronize_rcu() we're basically giving
> up. That's certainly a very good tradeoff for something like filesystem
> unregistration or module unload, it buys big simplifications in real
> fastpaths. But I just don't think it should be taken lightly.
So in the end, I've realized I don't need synchronize_rcu() at all and
in fact everything is OK even without call_rcu() if I base my fix on top
of your patch.

Attached is your patch with added comment I proposed and also a patch
fixing the second race. Better?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-fs-Fix-aio-rcu-ioctx-lookup.patch --]
[-- Type: text/x-patch, Size: 2358 bytes --]

>From 68857d7f2087edbbc5ee1d828f151ac46406f3be Mon Sep 17 00:00:00 2001
From: Nick Piggin <npiggin@gmail.com>
Date: Thu, 20 Jan 2011 20:08:52 +0100
Subject: [PATCH 1/2] fs: Fix aio rcu ioctx lookup

aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

lookup_ioctx doesn't implement the rcu lookup pattern properly.  rcu_read_lock
does not prevent refcount going to zero, so we might take a refcount on a zero
count ioctx.

Fix the bug by atomically testing for zero refcount before incrementing.

[JK: Added comment into the code]

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/aio.c |   35 ++++++++++++++++++++++++-----------
 1 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index fc557a3..b4dd668 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -239,15 +239,23 @@ static void __put_ioctx(struct kioctx *ctx)
 	call_rcu(&ctx->rcu_head, ctx_rcu_free);
 }
 
-#define get_ioctx(kioctx) do {						\
-	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
-	atomic_inc(&(kioctx)->users);					\
-} while (0)
-#define put_ioctx(kioctx) do {						\
-	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
-	if (unlikely(atomic_dec_and_test(&(kioctx)->users))) 		\
-		__put_ioctx(kioctx);					\
-} while (0)
+static inline void get_ioctx(struct kioctx *kioctx)
+{
+	BUG_ON(atomic_read(&kioctx->users) <= 0);
+	atomic_inc(&kioctx->users);
+}
+
+static inline int try_get_ioctx(struct kioctx *kioctx)
+{
+	return atomic_inc_not_zero(&kioctx->users);
+}
+
+static inline void put_ioctx(struct kioctx *kioctx)
+{
+	BUG_ON(atomic_read(&kioctx->users) <= 0);
+	if (unlikely(atomic_dec_and_test(&kioctx->users)))
+		__put_ioctx(kioctx);
+}
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
@@ -601,8 +609,13 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
 	rcu_read_lock();
 
 	hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
-		if (ctx->user_id == ctx_id && !ctx->dead) {
-			get_ioctx(ctx);
+		/*
+		 * RCU protects us against accessing freed memory but
+		 * we have to be careful not to get a reference when the
+		 * reference count already dropped to 0 (ctx->dead test
+		 * is unreliable because of races).
+		 */
+		if (ctx->user_id == ctx_id && !ctx->dead && try_get_ioctx(ctx)){
 			ret = ctx;
 			break;
 		}
-- 
1.7.1


[-- Attachment #3: 0002-fs-Fix-race-between-io_destroy-and-io_submit-in-AIO.patch --]
[-- Type: text/x-patch, Size: 1827 bytes --]

>From 6d5375d55b5d88e8ceda739052566e033be620c2 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Jan 2011 00:37:48 +0100
Subject: [PATCH 2/2] fs: Fix race between io_destroy() and io_submit() in AIO

A race can occur when io_submit() races with io_destroy():

 CPU1						CPU2
io_submit()
  do_io_submit()
    ...
    ctx = lookup_ioctx(ctx_id);
						io_destroy()
    Now do_io_submit() holds the last reference to ctx.
    ...
    queue new AIO
    put_ioctx(ctx) - frees ctx with active AIOs

We solve this issue by checking whether ctx is being destroyed
in AIO submission path after adding new AIO to ctx. Then we
are guaranteed that either io_destroy() waits for new AIO or
we see that ctx is being destroyed and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/aio.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b4dd668..0244c04 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1642,6 +1642,21 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 
 	spin_lock_irq(&ctx->ctx_lock);
+	/*
+	 * We could have raced with io_destroy() and are currently holding a
+	 * reference to ctx which should be destroyed. We cannot submit IO
+	 * since ctx gets freed as soon as io_submit() puts its reference.
+	 * The check here is reliable since io_destroy() sets ctx->dead before
+	 * waiting for outstanding IO. Thus if we don't see ctx->dead set here,
+	 * io_destroy() waits for our IO to finish.
+	 * The check is inside ctx->ctx_lock to avoid extra memory barrier
+	 * in this fast path...
+	 */
+	if (ctx->dead) {
+		spin_unlock_irq(&ctx->ctx_lock);
+		ret = -EINVAL;
+		goto out_put_req;
+	}
 	aio_run_iocb(req);
 	if (!list_empty(&ctx->run_list)) {
 		/* drain the run list */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-19 17:37                         ` Nick Piggin
@ 2011-01-20 20:21                           ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2011-01-20 20:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Thu 20-01-11 04:37:55, Nick Piggin wrote:
> On Thu, Jan 20, 2011 at 3:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 20-01-11 03:03:23, Nick Piggin wrote:
> >> On Thu, Jan 20, 2011 at 12:21 AM, Jan Kara <jack@suse.cz> wrote:
> >> >  Well, we are not required to cancel all the outstanding AIO because of the
> >> > API requirement, that's granted. But we must do it because of the way how
> >> > the code is written. Outstanding IO requests reference ioctx but they are
> >> > not counted in ctx->users but in ctx->reqs_active. So the code relies on
> >> > the fact that the reference held by the hash table protects ctx from being
> >> > freed and io_destroy() waits for requests before dropping the last
> >> > reference to ctx. But there's the second race I describe making it possible
> >> > for new IO to be created after io_destroy() has waited for all IO to
> >> > finish...
> >>
> >> Yes there is that race too I agree. I just didn't follow through the code far
> >> enough to see it was a problem -- I thought it was by design.
> >>
> >> I'd like to solve it without synchronize_rcu() though.
> >  Ah, OK. I don't find io_destroy() performance critical but I can
> 
> Probably not performance critical, but it could be a very
> large slowdown so somebody might complain.
> 
> > understand that you need not like synchronize_rcu() there. ;) Then it
> > should be possible to make IO requests count in ctx->users which would
> > solve the race as well. We'd just have to be prepared that request
> > completion might put the last reference to ioctx and free it but that
> > shouldn't be an issue. Do you like that solution better?
> 
> I think so, if it can be done without slowing things down
> and adding locks or atomics if possible.
  Actually, I found that freeing ioctx upon IO completion isn't
straightforward because freeing ioctx may need to sleep (it is destroying
work queue) and aio_complete() can be called from an interrupt context.
We could offload the sleeping work to the RCU callback (basically we'd have
to offload the whole __put_ioctx() to RCU callback) but I'm not convinced
it's worth it so I rather chose a bit more subtle approach for fixing the
race (see my patch).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
@ 2011-01-20 20:21                           ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2011-01-20 20:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, Andrew Morton, linux-fsdevel, linux-kernel

On Thu 20-01-11 04:37:55, Nick Piggin wrote:
> On Thu, Jan 20, 2011 at 3:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 20-01-11 03:03:23, Nick Piggin wrote:
> >> On Thu, Jan 20, 2011 at 12:21 AM, Jan Kara <jack@suse.cz> wrote:
> >> >  Well, we are not required to cancel all the outstanding AIO because of the
> >> > API requirement, that's granted. But we must do it because of the way how
> >> > the code is written. Outstanding IO requests reference ioctx but they are
> >> > not counted in ctx->users but in ctx->reqs_active. So the code relies on
> >> > the fact that the reference held by the hash table protects ctx from being
> >> > freed and io_destroy() waits for requests before dropping the last
> >> > reference to ctx. But there's the second race I describe making it possible
> >> > for new IO to be created after io_destroy() has waited for all IO to
> >> > finish...
> >>
> >> Yes there is that race too I agree. I just didn't follow through the code far
> >> enough to see it was a problem -- I thought it was by design.
> >>
> >> I'd like to solve it without synchronize_rcu() though.
> >  Ah, OK. I don't find io_destroy() performance critical but I can
> 
> Probably not performance critical, but it could be a very
> large slowdown so somebody might complain.
> 
> > understand that you need not like synchronize_rcu() there. ;) Then it
> > should be possible to make IO requests count in ctx->users which would
> > solve the race as well. We'd just have to be prepared that request
> > completion might put the last reference to ioctx and free it but that
> > shouldn't be an issue. Do you like that solution better?
> 
> I think so, if it can be done without slowing things down
> and adding locks or atomics if possible.
  Actually, I found that freeing ioctx upon IO completion isn't
straightforward because freeing ioctx may need to sleep (it is destroying
work queue) and aio_complete() can be called from an interrupt context.
We could offload the sleeping work to the RCU callback (basically we'd have
to offload the whole __put_ioctx() to RCU callback) but I'm not convinced
it's worth it so I rather chose a bit more subtle approach for fixing the
race (see my patch).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20 20:16                                     ` Jan Kara
@ 2011-01-20 21:16                                       ` Jeff Moyer
  2011-02-01 16:24                                       ` Jan Kara
  1 sibling, 0 replies; 40+ messages in thread
From: Jeff Moyer @ 2011-01-20 21:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nick Piggin, paulmck, Andrew Morton, linux-fsdevel, linux-kernel

Jan Kara <jack@suse.cz> writes:

> So in the end, I've realized I don't need synchronize_rcu() at all and
> in fact everything is OK even without call_rcu() if I base my fix on top
> of your patch.
>
> Attached is your patch with added comment I proposed and also a patch
> fixing the second race. Better?

[snip]

> From 6d5375d55b5d88e8ceda739052566e033be620c2 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 19 Jan 2011 00:37:48 +0100
> Subject: [PATCH 2/2] fs: Fix race between io_destroy() and io_submit() in AIO
>
> A race can occur when io_submit() races with io_destroy():
>
>  CPU1						CPU2
> io_submit()
>   do_io_submit()
>     ...
>     ctx = lookup_ioctx(ctx_id);
> 						io_destroy()
>     Now do_io_submit() holds the last reference to ctx.
>     ...
>     queue new AIO
>     put_ioctx(ctx) - frees ctx with active AIOs

[snip]

> We solve this issue by checking whether ctx is being destroyed
> in AIO submission path after adding new AIO to ctx. Then we
> are guaranteed that either io_destroy() waits for new AIO or
> we see that ctx is being destroyed and bail out.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/aio.c |   15 +++++++++++++++
>  1 files changed, 15 insertions(+), 0 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index b4dd668..0244c04 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1642,6 +1642,21 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  		goto out_put_req;
>  
>  	spin_lock_irq(&ctx->ctx_lock);
> +	/*
> +	 * We could have raced with io_destroy() and are currently holding a
> +	 * reference to ctx which should be destroyed. We cannot submit IO
> +	 * since ctx gets freed as soon as io_submit() puts its reference.
> +	 * The check here is reliable since io_destroy() sets ctx->dead before
> +	 * waiting for outstanding IO. Thus if we don't see ctx->dead set here,
> +	 * io_destroy() waits for our IO to finish.
> +	 * The check is inside ctx->ctx_lock to avoid extra memory barrier
> +	 * in this fast path...
> +	 */
> +	if (ctx->dead) {
> +		spin_unlock_irq(&ctx->ctx_lock);
> +		ret = -EINVAL;
> +		goto out_put_req;
> +	}
>  	aio_run_iocb(req);
>  	if (!list_empty(&ctx->run_list)) {
>  		/* drain the run list */

OK, that's clever.  Thanks for looking into this, Jan!

You can put my:

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

on both patches.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20 20:15                                       ` Eric Dumazet
@ 2011-01-21 21:22                                         ` Paul E. McKenney
  0 siblings, 0 replies; 40+ messages in thread
From: Paul E. McKenney @ 2011-01-21 21:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel,
	linux-kernel

On Thu, Jan 20, 2011 at 09:15:55PM +0100, Eric Dumazet wrote:
> Le jeudi 20 janvier 2011 à 12:02 -0800, Paul E. McKenney a écrit :
> > On Fri, Jan 21, 2011 at 05:31:53AM +1100, Nick Piggin wrote:
> > > call_rcu() is the obvious alternative, yes.
> > > 
> > > Basically, once we give in to synchronize_rcu() we're basically giving
> > > up. That's certainly a very good tradeoff for something like filesystem
> > > unregistration or module unload, it buys big simplifications in real
> > > fastpaths. But I just don't think it should be taken lightly.
> > 
> > Makes sense to me!
> > 
> > BTW, on your earlier usage classification:
> > 
> > > I think synchronize_rcu should firstly not be used unless it gives a good
> > > simplification, or speedup in fastpath.
> > >
> > > When that is satified, then it is a question of exactly what kind of slow
> > > path it should be used in. I don't think it should be used in process-
> > > synchronous code (eg syscalls) except for error cases, resource
> > > exhaustion, management syscalls (like module unload).
> > 
> > I don't have any feedback either way on your guidance to where
> > synchronize_rcu() should be used, as I believe that it depends a lot
> > on the details of usage, and would vary from one part of the kernel
> > to another, and possibly also over time.
> > 
> 
> Sometime, a mixture of call_rcu() and synchronize_rcu() is used, to have
> a limit on pending callbacks (eating too much memory)
> 
> net/ipv4/fib_trie.c for example issues call_rcu() most of the time, but
> is able to trigger one synchronize_rcu() if more than XXX (128) pages of
> memory were queued in rcu queues.
> 
> For details, check commit c3059477fce2d956
> (ipv4: Use synchronize_rcu() during trie_rebalance())

Good point!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch] fs: aio fix rcu lookup
  2011-01-20 20:16                                     ` Jan Kara
  2011-01-20 21:16                                       ` Jeff Moyer
@ 2011-02-01 16:24                                       ` Jan Kara
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Kara @ 2011-02-01 16:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Jeff Moyer, Jan Kara, Andrew Morton, linux-fsdevel,
	linux-kernel

On Thu 20-01-11 21:16:02, Jan Kara wrote:
> On Fri 21-01-11 05:31:53, Nick Piggin wrote:
> > On Thu, Jan 20, 2011 at 3:03 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > > On Thu, Jan 20, 2011 at 08:20:00AM +1100, Nick Piggin wrote:
> > >> On Thu, Jan 20, 2011 at 8:03 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > >> >> I don't know exactly how all programs use io_destroy -- of the small
> > >> >> number that do, probably an even smaller number would care here. But I
> > >> >> don't think it simplifies things enough to use synchronize_rcu for it.
> > >> >
> > >> > Above it sounded like you didn't think AIO should be using RCU at all.
> > >>
> > >> synchronize_rcu of course, not RCU (typo).
> > >
> > > I think that Nick is suggesting that call_rcu() be used instead.
> > > Perhaps also very sparing use of synchronize_rcu_expedited(), which
> > > is faster than synchronize_rcu(), but which which uses more CPU time.
> > 
> > call_rcu() is the obvious alternative, yes.
> > 
> > Basically, once we give in to synchronize_rcu() we're basically giving
> > up. That's certainly a very good tradeoff for something like filesystem
> > unregistration or module unload, it buys big simplifications in real
> > fastpaths. But I just don't think it should be taken lightly.
> So in the end, I've realized I don't need synchronize_rcu() at all and
> in fact everything is OK even without call_rcu() if I base my fix on top
> of your patch.
> 
> Attached is your patch with added comment I proposed and also a patch
> fixing the second race. Better?
  Nick, any opinion on this? Should I push the patches upstream?

									Honza

> From 68857d7f2087edbbc5ee1d828f151ac46406f3be Mon Sep 17 00:00:00 2001
> From: Nick Piggin <npiggin@gmail.com>
> Date: Thu, 20 Jan 2011 20:08:52 +0100
> Subject: [PATCH 1/2] fs: Fix aio rcu ioctx lookup
> 
> aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
> 
> lookup_ioctx doesn't implement the rcu lookup pattern properly.  rcu_read_lock
> does not prevent refcount going to zero, so we might take a refcount on a zero
> count ioctx.
> 
> Fix the bug by atomically testing for zero refcount before incrementing.
> 
> [JK: Added comment into the code]
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/aio.c |   35 ++++++++++++++++++++++++-----------
>  1 files changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index fc557a3..b4dd668 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -239,15 +239,23 @@ static void __put_ioctx(struct kioctx *ctx)
>  	call_rcu(&ctx->rcu_head, ctx_rcu_free);
>  }
>  
> -#define get_ioctx(kioctx) do {						\
> -	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
> -	atomic_inc(&(kioctx)->users);					\
> -} while (0)
> -#define put_ioctx(kioctx) do {						\
> -	BUG_ON(atomic_read(&(kioctx)->users) <= 0);			\
> -	if (unlikely(atomic_dec_and_test(&(kioctx)->users))) 		\
> -		__put_ioctx(kioctx);					\
> -} while (0)
> +static inline void get_ioctx(struct kioctx *kioctx)
> +{
> +	BUG_ON(atomic_read(&kioctx->users) <= 0);
> +	atomic_inc(&kioctx->users);
> +}
> +
> +static inline int try_get_ioctx(struct kioctx *kioctx)
> +{
> +	return atomic_inc_not_zero(&kioctx->users);
> +}
> +
> +static inline void put_ioctx(struct kioctx *kioctx)
> +{
> +	BUG_ON(atomic_read(&kioctx->users) <= 0);
> +	if (unlikely(atomic_dec_and_test(&kioctx->users)))
> +		__put_ioctx(kioctx);
> +}
>  
>  /* ioctx_alloc
>   *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
> @@ -601,8 +609,13 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
>  	rcu_read_lock();
>  
>  	hlist_for_each_entry_rcu(ctx, n, &mm->ioctx_list, list) {
> -		if (ctx->user_id == ctx_id && !ctx->dead) {
> -			get_ioctx(ctx);
> +		/*
> +		 * RCU protects us against accessing freed memory but
> +		 * we have to be careful not to get a reference when the
> +		 * reference count already dropped to 0 (ctx->dead test
> +		 * is unreliable because of races).
> +		 */
> +		if (ctx->user_id == ctx_id && !ctx->dead && try_get_ioctx(ctx)){
>  			ret = ctx;
>  			break;
>  		}
> -- 
> 1.7.1
> 

> From 6d5375d55b5d88e8ceda739052566e033be620c2 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 19 Jan 2011 00:37:48 +0100
> Subject: [PATCH 2/2] fs: Fix race between io_destroy() and io_submit() in AIO
> 
> A race can occur when io_submit() races with io_destroy():
> 
>  CPU1						CPU2
> io_submit()
>   do_io_submit()
>     ...
>     ctx = lookup_ioctx(ctx_id);
> 						io_destroy()
>     Now do_io_submit() holds the last reference to ctx.
>     ...
>     queue new AIO
>     put_ioctx(ctx) - frees ctx with active AIOs
> 
> We solve this issue by checking whether ctx is being destroyed
> in AIO submission path after adding new AIO to ctx. Then we
> are guaranteed that either io_destroy() waits for new AIO or
> we see that ctx is being destroyed and bail out.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/aio.c |   15 +++++++++++++++
>  1 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index b4dd668..0244c04 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1642,6 +1642,21 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  		goto out_put_req;
>  
>  	spin_lock_irq(&ctx->ctx_lock);
> +	/*
> +	 * We could have raced with io_destroy() and are currently holding a
> +	 * reference to ctx which should be destroyed. We cannot submit IO
> +	 * since ctx gets freed as soon as io_submit() puts its reference.
> +	 * The check here is reliable since io_destroy() sets ctx->dead before
> +	 * waiting for outstanding IO. Thus if we don't see ctx->dead set here,
> +	 * io_destroy() waits for our IO to finish.
> +	 * The check is inside ctx->ctx_lock to avoid extra memory barrier
> +	 * in this fast path...
> +	 */
> +	if (ctx->dead) {
> +		spin_unlock_irq(&ctx->ctx_lock);
> +		ret = -EINVAL;
> +		goto out_put_req;
> +	}
>  	aio_run_iocb(req);
>  	if (!list_empty(&ctx->run_list)) {
>  		/* drain the run list */
> -- 
> 1.7.1
> 

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-02-01 16:24 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-14  1:35 [patch] fs: aio fix rcu lookup Nick Piggin
2011-01-14 14:52 ` Jeff Moyer
2011-01-14 15:00   ` Nick Piggin
2011-01-14 15:00     ` Nick Piggin
2011-01-17 19:07     ` Jeff Moyer
2011-01-17 23:24       ` Nick Piggin
2011-01-18 17:21         ` Jeff Moyer
2011-01-18 17:21           ` Jeff Moyer
2011-01-18 19:01         ` Jan Kara
2011-01-18 22:17           ` Nick Piggin
2011-01-18 23:00             ` Jeff Moyer
2011-01-18 23:05               ` Nick Piggin
2011-01-18 23:05                 ` Nick Piggin
2011-01-18 23:52             ` Jan Kara
2011-01-18 23:52               ` Jan Kara
2011-01-19  0:20               ` Nick Piggin
2011-01-19 13:21                 ` Jan Kara
2011-01-19 16:03                   ` Nick Piggin
2011-01-19 16:50                     ` Jan Kara
2011-01-19 17:37                       ` Nick Piggin
2011-01-19 17:37                         ` Nick Piggin
2011-01-20 20:21                         ` Jan Kara
2011-01-20 20:21                           ` Jan Kara
2011-01-19 19:13                   ` Jeff Moyer
2011-01-19 19:46                     ` Jeff Moyer
2011-01-19 20:18                       ` Nick Piggin
2011-01-19 20:32                         ` Jeff Moyer
2011-01-19 20:45                           ` Nick Piggin
2011-01-19 21:03                             ` Jeff Moyer
2011-01-19 21:03                               ` Jeff Moyer
2011-01-19 21:20                               ` Nick Piggin
2011-01-19 21:20                                 ` Nick Piggin
2011-01-20  4:03                                 ` Paul E. McKenney
2011-01-20 18:31                                   ` Nick Piggin
2011-01-20 20:02                                     ` Paul E. McKenney
2011-01-20 20:15                                       ` Eric Dumazet
2011-01-21 21:22                                         ` Paul E. McKenney
2011-01-20 20:16                                     ` Jan Kara
2011-01-20 21:16                                       ` Jeff Moyer
2011-02-01 16:24                                       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.