All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] aio: create aio_ring_insert_entry() function
@ 2007-02-19 21:38 Zach Brown
  2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
  2007-02-19 22:41 ` [PATCH 1/2] aio: create aio_ring_insert_entry() function Jeff Moyer
  0 siblings, 2 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-19 21:38 UTC (permalink / raw)
  To: linux-aio, linux-kernel
  Cc: Benjamin LaHaise, Suparna bhattacharya, Andrew Morton, Leonid Ananiev

aio: create aio_ring_insert_entry() function

This patch just hoists the process of inserting an entry into the completion
ring into its own function.  No functionality is changed.  This is in
preparation for calling ring insertion from another path with slightly
different mechanics.

diff -r b551e6cbf434 fs/aio.c
Signed-off-by: Zach Brown <zach.brown@oracle.com>
---

 fs/aio.c |   77 +++++++++++++++++++++++++++++------------------------
 1 file changed, 43 insertions(+), 34 deletions(-)

--- a/fs/aio.c	Mon Feb 19 13:09:01 2007 -0800
+++ b/fs/aio.c	Mon Feb 19 13:12:04 2007 -0800
@@ -192,6 +192,47 @@ static int aio_setup_ring(struct kioctx 
 	(void)__event;				\
 	kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
 } while(0)
+
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb,
+				  long res, long res2)
+{
+	struct aio_ring_info	*info;
+	struct aio_ring		*ring;
+	struct io_event		*event;
+	unsigned long		tail;	
+
+	assert_spin_locked(&ctx->ctx_lock);
+
+	info = &ctx->ring_info;
+	ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
+
+	tail = info->tail;
+	event = aio_ring_event(info, tail, KM_IRQ0);
+	if (++tail >= info->nr)
+		tail = 0;
+
+	event->obj = (u64)(unsigned long)iocb->ki_obj.user;
+	event->data = iocb->ki_user_data;
+	event->res = res;
+	event->res2 = res2;
+
+	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
+		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
+		res, res2);
+
+	/* after flagging the request as done, we
+	 * must never even look at it again
+	 */
+	smp_wmb();	/* make event visible before updating tail */
+
+	info->tail = tail;
+	ring->tail = tail;
+
+	put_aio_ring_event(event, KM_IRQ0);
+	kunmap_atomic(ring, KM_IRQ1);
+
+	pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+}
 
 /* ioctx_alloc
  *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
@@ -924,11 +965,7 @@ int fastcall aio_complete(struct kiocb *
 int fastcall aio_complete(struct kiocb *iocb, long res, long res2)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
-	struct aio_ring_info	*info;
-	struct aio_ring	*ring;
-	struct io_event	*event;
 	unsigned long	flags;
-	unsigned long	tail;
 	int		ret;
 
 	/*
@@ -946,8 +983,6 @@ int fastcall aio_complete(struct kiocb *
 		return 1;
 	}
 
-	info = &ctx->ring_info;
-
 	/* add a completion event to the ring buffer.
 	 * must be done holding ctx->ctx_lock to prevent
 	 * other code from messing with the tail
@@ -966,34 +1001,8 @@ int fastcall aio_complete(struct kiocb *
 	if (kiocbIsCancelled(iocb))
 		goto put_rq;
 
-	ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
-
-	tail = info->tail;
-	event = aio_ring_event(info, tail, KM_IRQ0);
-	if (++tail >= info->nr)
-		tail = 0;
-
-	event->obj = (u64)(unsigned long)iocb->ki_obj.user;
-	event->data = iocb->ki_user_data;
-	event->res = res;
-	event->res2 = res2;
-
-	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
-		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-		res, res2);
-
-	/* after flagging the request as done, we
-	 * must never even look at it again
-	 */
-	smp_wmb();	/* make event visible before updating tail */
-
-	info->tail = tail;
-	ring->tail = tail;
-
-	put_aio_ring_event(event, KM_IRQ0);
-	kunmap_atomic(ring, KM_IRQ1);
-
-	pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+	aio_ring_insert_entry(ctx, iocb, res, res2);
+
 put_rq:
 	/* everything turned out well, dispose of the aiocb. */
 	ret = __aio_put_req(ctx, iocb);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-19 21:38 [PATCH 1/2] aio: create aio_ring_insert_entry() function Zach Brown
@ 2007-02-19 21:38 ` Zach Brown
  2007-02-19 22:48   ` Jeff Moyer
                     ` (3 more replies)
  2007-02-19 22:41 ` [PATCH 1/2] aio: create aio_ring_insert_entry() function Jeff Moyer
  1 sibling, 4 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-19 21:38 UTC (permalink / raw)
  To: linux-aio, linux-kernel
  Cc: Benjamin LaHaise, Suparna bhattacharya, Andrew Morton, Leonid Ananiev

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev <leonid.i.ananiev@intel.com>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to
call aio_complete() once the bios complete.   As we return from submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the
bio completion call aio_complete().  This stops us from returning errors after
O_DIRECT submission.

But we have a few places that are very interested in generating errors after
bio submission.

The most critical of these is invalidating the page cache after a write.  This
avoids exposing stale data to buffered operations that are performed after the
O_DIRECT write succeeds.  We must do this after submission because the user
buffer might have been an mmap()ed buffer of the region being written to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This gives the
submission path time to queue its pending error before it drops its reference.
This increases the space in the iocb as it has to record the two result codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees -EIO.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---

 fs/aio.c            |   49 +++++++++++++++++++++---------------------
 include/linux/aio.h |   30 +++++++++++++++++++++++++
 mm/filemap.c        |    4 +--
 3 files changed, 57 insertions(+), 26 deletions(-)

--- a/fs/aio.c	Mon Feb 19 13:12:20 2007 -0800
+++ b/fs/aio.c	Mon Feb 19 13:16:00 2007 -0800
@@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
 	kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
 } while(0)
 
-static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb,
-				  long res, long res2)
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb)
 {
 	struct aio_ring_info	*info;
 	struct aio_ring		*ring;
@@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
 
 	event->obj = (u64)(unsigned long)iocb->ki_obj.user;
 	event->data = iocb->ki_user_data;
-	event->res = res;
-	event->res2 = res2;
-
-	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
+	event->res = iocb->ki_pending_err ? iocb->ki_pending_err : iocb->ki_res;
+	event->res2 = iocb->ki_res2;
+
+	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
 		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-		res, res2);
+		iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
 
 	/* after flagging the request as done, we
 	 * must never even look at it again
@@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 	req->ki_dtor = NULL;
+	req->ki_pending_err = 0;
 	req->private = NULL;
 	req->ki_iovec = NULL;
 	INIT_LIST_HEAD(&req->ki_run_list);
@@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
+
+	if (kiocbIsInserted(req))
+		aio_ring_insert_entry(ctx, req);
+
 	list_del(&req->ki_list);		/* remove from active_reqs */
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
@@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb *
 		return 1;
 	}
 
-	/* add a completion event to the ring buffer.
-	 * must be done holding ctx->ctx_lock to prevent
-	 * other code from messing with the tail
-	 * pointer since we might be called from irq
-	 * context.
-	 */
+	/*
+	 * We queue up the completion codes into the iocb.  They are combined
+	 * with a potential error from the submission path and inserted into
+	 * the ring once the last reference to the iocb is dropped.  Cancelled
+	 * iocbs don't insert events on completion because userland was given
+	 * an event directly as part of the cancelation interface.
+ 	 */
 	spin_lock_irqsave(&ctx->ctx_lock, flags);
 
 	if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
 		list_del_init(&iocb->ki_run_list);
 
-	/*
-	 * cancelled requests don't get events, userland was given one
-	 * when the event got cancelled.
-	 */
-	if (kiocbIsCancelled(iocb))
-		goto put_rq;
-
-	aio_ring_insert_entry(ctx, iocb, res, res2);
-
-put_rq:
+	if (!kiocbIsCancelled(iocb)) {
+		iocb->ki_res = res;
+		iocb->ki_res2 = res2;
+		kiocbSetInserted(iocb);
+	}
+
 	/* everything turned out well, dispose of the aiocb. */
 	ret = __aio_put_req(ctx, iocb);
 
diff -r 8a740eb579d4 include/linux/aio.h
--- a/include/linux/aio.h	Mon Feb 19 13:12:20 2007 -0800
+++ b/include/linux/aio.h	Mon Feb 19 13:16:00 2007 -0800
@@ -34,6 +34,7 @@ struct kioctx;
 /* #define KIF_LOCKED		0 */
 #define KIF_KICKED		1
 #define KIF_CANCELLED		2
+#define KIF_INSERTED		4
 
 #define kiocbTryLock(iocb)	test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbTryKick(iocb)	test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
@@ -41,6 +42,7 @@ struct kioctx;
 #define kiocbSetLocked(iocb)	set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)	set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)	set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbSetInserted(iocb)	set_bit(KIF_INSERTED, &(iocb)->ki_flags)
 
 #define kiocbClearLocked(iocb)	clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbClearKicked(iocb)	clear_bit(KIF_KICKED, &(iocb)->ki_flags)
@@ -49,6 +51,7 @@ struct kioctx;
 #define kiocbIsLocked(iocb)	test_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbIsKicked(iocb)	test_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbIsCancelled(iocb)	test_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbIsInserted(iocb)	test_bit(KIF_INSERTED, &(iocb)->ki_flags)
 
 /* is there a better place to document function pointer methods? */
 /**
@@ -119,6 +122,10 @@ struct kiocb {
 
 	struct list_head	ki_list;	/* the aio core uses this
 						 * for cancellation */
+	/* we store and combine return codes from submission and completion */ 
+	int			ki_pending_err;
+	long			ki_res;
+	long			ki_res2;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
@@ -246,6 +253,29 @@ static inline struct kiocb *list_kiocb(s
 	return list_entry(h, struct kiocb, ki_list);
 }
 
+/*
+ * This function is used to make sure that an error is communicated to
+ * userspace on iocb completion without stopping -EIOCBQUEUED from bubbling up
+ * to fs/aio.c from the place where it originated.
+ *
+ * If we have an existing -EIOCBQUEUED it must be returned all the way to
+ * fs/aio.c so that it doesn't double-complete the iocb along with whoever
+ * returned -EIOCBQUEUED..  In that case we put the new error in the iocb.  It
+ * will be returned to userspace *intead of* the first result code given to
+ * aio_complete().  Use this only for errors which must overwrite whatever the
+ * return code might have been.  The first non-zero new_err given to this
+ * function for a given iocb will be returned to userspace.
+ */
+static inline int aio_propogate_error(struct kiocb *iocb, int existing_err,
+				      int new_err)
+{
+	if (existing_err != -EIOCBQUEUED)
+		return new_err;
+	if (!iocb->ki_pending_err)
+		iocb->ki_pending_err = new_err;
+	return -EIOCBQUEUED;
+}
+
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
diff -r 8a740eb579d4 mm/filemap.c
--- a/mm/filemap.c	Mon Feb 19 13:12:20 2007 -0800
+++ b/mm/filemap.c	Mon Feb 19 13:16:00 2007 -0800
@@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb *
 	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
 		if (err < 0)
-			written = err;
+			written = aio_propogate_error(iocb, written, err);
 	}
 	return written;
 }
@@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki
 			int err = invalidate_inode_pages2_range(mapping,
 					offset >> PAGE_CACHE_SHIFT, end);
 			if (err)
-				retval = err;
+				retval = aio_propogate_error(iocb, retval, err);
 		}
 	}
 	return retval;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] aio: create aio_ring_insert_entry() function
  2007-02-19 21:38 [PATCH 1/2] aio: create aio_ring_insert_entry() function Zach Brown
  2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
@ 2007-02-19 22:41 ` Jeff Moyer
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Moyer @ 2007-02-19 22:41 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-aio, linux-kernel, Benjamin LaHaise, Suparna bhattacharya,
	Andrew Morton, Leonid Ananiev

==> On Mon, 19 Feb 2007 13:38:30 -0800 (PST), Zach Brown <zach.brown@oracle.com> said:

Zach> This patch just hoists the process of inserting an entry into the completion
Zach> ring into its own function.  No functionality is changed.  This is in
Zach> preparation for calling ring insertion from another path with slightly
Zach> different mechanics.

Looks harmless.

Acked-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
@ 2007-02-19 22:48   ` Jeff Moyer
  2007-02-19 22:57     ` Zach Brown
  2007-02-20 16:57   ` Ananiev, Leonid I
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Jeff Moyer @ 2007-02-19 22:48 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-aio, linux-kernel, Benjamin LaHaise, Suparna bhattacharya,
	Andrew Morton, Leonid Ananiev

==> On Mon, 19 Feb 2007 13:38:35 -0800 (PST), Zach Brown <zach.brown@oracle.com> said:

Zach> This addresses an oops reported by Leonid Ananiev <leonid.i.ananiev@intel.com>
Zach> as archived at http://lkml.org/lkml/2007/2/8/337.
[snip]
Zach> So this patch introduces a helper, aio_propogate_error(), 

...which is spelled incorrectly:  aio_propagate_error.

Zach> which queues post-submission errors in the iocb so that they are
Zach> given to the user completion event when aio_complete() is
Zach> finally called.

Ugly, but I can't think of a better way to do it, either.

> +
> +	if (kiocbIsInserted(req))
> +		aio_ring_insert_entry(ctx, req);
> +

This confused me at first reading.  If it's inserted, then insert it.
Perhaps KIF_COMPLETED would be a better name?

> +/*
> + * This function is used to make sure that an error is communicated to
> + * userspace on iocb completion without stopping -EIOCBQUEUED from bubbling up
> + * to fs/aio.c from the place where it originated.
> + *
> + * If we have an existing -EIOCBQUEUED it must be returned all the way to
> + * fs/aio.c so that it doesn't double-complete the iocb along with whoever
> + * returned -EIOCBQUEUED..  In that case we put the new error in the iocb.  It
> + * will be returned to userspace *intead of* the first result code given to
> + * aio_complete().  Use this only for errors which must overwrite whatever the
> + * return code might have been.  The first non-zero new_err given to this
> + * function for a given iocb will be returned to userspace.
> + */
> +static inline int aio_propogate_error(struct kiocb *iocb, int existing_err,
> +				      int new_err)
> +{
> +	if (existing_err != -EIOCBQUEUED)
> +		return new_err;
> +	if (!iocb->ki_pending_err)
> +		iocb->ki_pending_err = new_err;
> +	return -EIOCBQUEUED;
> +}
> +

I think that we would ideally cancel the I/Os in flight since we are
going to throw away the results.  However, we don't exactly support
cancellation today.  So, I have no objection to this less optimal
solution.

-Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-19 22:48   ` Jeff Moyer
@ 2007-02-19 22:57     ` Zach Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-19 22:57 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-aio, linux-kernel, Benjamin LaHaise, Suparna bhattacharya,
	Andrew Morton, Leonid Ananiev


> Zach> So this patch introduces a helper, aio_propogate_error(),
>
> ...which is spelled incorrectly:  aio_propagate_error.

Man, I am batting 1000!  Randy also made fun of my 'intead'.

> Zach> which queues post-submission errors in the iocb so that they are
> Zach> given to the user completion event when aio_complete() is
> Zach> finally called.
>
> Ugly, but I can't think of a better way to do it, either.

Yeah, this seemed to be the lesser of the available evils.  We (Chris  
and I, while in california) considered introducing a primitive to  
have the submission path wait for aio_complete() to be called so that  
it could just return the error.  We also thought about turning EIOCB 
{RETRY,QUEUED} into bits on the iocb instead of return codes that we  
have to lovingly pass back up to fs/aio.c.

This seemed to be the least intrusive :/.

- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
  2007-02-19 22:48   ` Jeff Moyer
@ 2007-02-20 16:57   ` Ananiev, Leonid I
  2007-02-20 17:03     ` Chris Mason
  2007-02-20 17:33   ` Ananiev, Leonid I
  2007-02-21 14:13   ` Suparna Bhattacharya
  3 siblings, 1 reply; 18+ messages in thread
From: Ananiev, Leonid I @ 2007-02-20 16:57 UTC (permalink / raw)
  To: Zach Brown, linux-aio, linux-kernel
  Cc: Benjamin LaHaise, Suparna bhattacharya, Andrew Morton

Zach> This addresses an oops reported by Leonid Ananiev
<leonid.i.ananiev@intel.com>
Zach> as archived at http://lkml.org/lkml/2007/2/8/337.
....
Zach> This was tested by running O_DIRECT aio-stress concurrently with
buffered reads

The oops was with aio-stress only running in the loop
WITHOUT buffered or mmaped IO which are patched and discussed now.
Actually 47% aio is finished with EIO after patch.

Leonnid

-----Original Message-----
From: Zach Brown [mailto:zach.brown@oracle.com] 
Sent: Tuesday, February 20, 2007 12:39 AM
To: linux-aio@kvack.org; linux-kernel@vger.kernel.org
Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev,
Leonid I
Subject: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev
<leonid.i.ananiev@intel.com>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its
intention to
call aio_complete() once the bios complete.   As we return from
submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let
the
bio completion call aio_complete().  This stops us from returning errors
after
O_DIRECT submission.

But we have a few places that are very interested in generating errors
after
bio submission.

The most critical of these is invalidating the page cache after a write.
This
avoids exposing stale data to buffered operations that are performed
after the
O_DIRECT write succeeds.  We must do this after submission because the
user
buffer might have been an mmap()ed buffer of the region being written
to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in
stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This
gives the
submission path time to queue its pending error before it drops its
reference.
This increases the space in the iocb as it has to record the two result
codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with
buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of
a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees
-EIO.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---

 fs/aio.c            |   49 +++++++++++++++++++++---------------------
 include/linux/aio.h |   30 +++++++++++++++++++++++++
 mm/filemap.c        |    4 +--
 3 files changed, 57 insertions(+), 26 deletions(-)

--- a/fs/aio.c	Mon Feb 19 13:12:20 2007 -0800
+++ b/fs/aio.c	Mon Feb 19 13:16:00 2007 -0800
@@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
 	kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km);
\
 } while(0)
 
-static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb,
-				  long res, long res2)
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb)
 {
 	struct aio_ring_info	*info;
 	struct aio_ring		*ring;
@@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
 
 	event->obj = (u64)(unsigned long)iocb->ki_obj.user;
 	event->data = iocb->ki_user_data;
-	event->res = res;
-	event->res2 = res2;
-
-	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
+	event->res = iocb->ki_pending_err ? iocb->ki_pending_err :
iocb->ki_res;
+	event->res2 = iocb->ki_res2;
+
+	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
 		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-		res, res2);
+		iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
 
 	/* after flagging the request as done, we
 	 * must never even look at it again
@@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 	req->ki_dtor = NULL;
+	req->ki_pending_err = 0;
 	req->private = NULL;
 	req->ki_iovec = NULL;
 	INIT_LIST_HEAD(&req->ki_run_list);
@@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
+
+	if (kiocbIsInserted(req))
+		aio_ring_insert_entry(ctx, req);
+
 	list_del(&req->ki_list);		/* remove from
active_reqs */
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
@@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb *
 		return 1;
 	}
 
-	/* add a completion event to the ring buffer.
-	 * must be done holding ctx->ctx_lock to prevent
-	 * other code from messing with the tail
-	 * pointer since we might be called from irq
-	 * context.
-	 */
+	/*
+	 * We queue up the completion codes into the iocb.  They are
combined
+	 * with a potential error from the submission path and inserted
into
+	 * the ring once the last reference to the iocb is dropped.
Cancelled
+	 * iocbs don't insert events on completion because userland was
given
+	 * an event directly as part of the cancelation interface.
+ 	 */
 	spin_lock_irqsave(&ctx->ctx_lock, flags);
 
 	if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
 		list_del_init(&iocb->ki_run_list);
 
-	/*
-	 * cancelled requests don't get events, userland was given one
-	 * when the event got cancelled.
-	 */
-	if (kiocbIsCancelled(iocb))
-		goto put_rq;
-
-	aio_ring_insert_entry(ctx, iocb, res, res2);
-
-put_rq:
+	if (!kiocbIsCancelled(iocb)) {
+		iocb->ki_res = res;
+		iocb->ki_res2 = res2;
+		kiocbSetInserted(iocb);
+	}
+
 	/* everything turned out well, dispose of the aiocb. */
 	ret = __aio_put_req(ctx, iocb);
 
diff -r 8a740eb579d4 include/linux/aio.h
--- a/include/linux/aio.h	Mon Feb 19 13:12:20 2007 -0800
+++ b/include/linux/aio.h	Mon Feb 19 13:16:00 2007 -0800
@@ -34,6 +34,7 @@ struct kioctx;
 /* #define KIF_LOCKED		0 */
 #define KIF_KICKED		1
 #define KIF_CANCELLED		2
+#define KIF_INSERTED		4
 
 #define kiocbTryLock(iocb)	test_and_set_bit(KIF_LOCKED,
&(iocb)->ki_flags)
 #define kiocbTryKick(iocb)	test_and_set_bit(KIF_KICKED,
&(iocb)->ki_flags)
@@ -41,6 +42,7 @@ struct kioctx;
 #define kiocbSetLocked(iocb)	set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)	set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)	set_bit(KIF_CANCELLED,
&(iocb)->ki_flags)
+#define kiocbSetInserted(iocb)	set_bit(KIF_INSERTED, &(iocb)->ki_flags)
 
 #define kiocbClearLocked(iocb)	clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbClearKicked(iocb)	clear_bit(KIF_KICKED, &(iocb)->ki_flags)
@@ -49,6 +51,7 @@ struct kioctx;
 #define kiocbIsLocked(iocb)	test_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbIsKicked(iocb)	test_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbIsCancelled(iocb)	test_bit(KIF_CANCELLED,
&(iocb)->ki_flags)
+#define kiocbIsInserted(iocb)	test_bit(KIF_INSERTED,
&(iocb)->ki_flags)
 
 /* is there a better place to document function pointer methods? */
 /**
@@ -119,6 +122,10 @@ struct kiocb {
 
 	struct list_head	ki_list;	/* the aio core uses
this
 						 * for cancellation */
+	/* we store and combine return codes from submission and
completion */ 
+	int			ki_pending_err;
+	long			ki_res;
+	long			ki_res2;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
@@ -246,6 +253,29 @@ static inline struct kiocb *list_kiocb(s
 	return list_entry(h, struct kiocb, ki_list);
 }
 
+/*
+ * This function is used to make sure that an error is communicated to
+ * userspace on iocb completion without stopping -EIOCBQUEUED from
bubbling up
+ * to fs/aio.c from the place where it originated.
+ *
+ * If we have an existing -EIOCBQUEUED it must be returned all the way
to
+ * fs/aio.c so that it doesn't double-complete the iocb along with
whoever
+ * returned -EIOCBQUEUED..  In that case we put the new error in the
iocb.  It
+ * will be returned to userspace *intead of* the first result code
given to
+ * aio_complete().  Use this only for errors which must overwrite
whatever the
+ * return code might have been.  The first non-zero new_err given to
this
+ * function for a given iocb will be returned to userspace.
+ */
+static inline int aio_propogate_error(struct kiocb *iocb, int
existing_err,
+				      int new_err)
+{
+	if (existing_err != -EIOCBQUEUED)
+		return new_err;
+	if (!iocb->ki_pending_err)
+		iocb->ki_pending_err = new_err;
+	return -EIOCBQUEUED;
+}
+
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
diff -r 8a740eb579d4 mm/filemap.c
--- a/mm/filemap.c	Mon Feb 19 13:12:20 2007 -0800
+++ b/mm/filemap.c	Mon Feb 19 13:16:00 2007 -0800
@@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb *
 	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		int err = generic_osync_inode(inode, mapping,
OSYNC_METADATA);
 		if (err < 0)
-			written = err;
+			written = aio_propogate_error(iocb, written,
err);
 	}
 	return written;
 }
@@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki
 			int err = invalidate_inode_pages2_range(mapping,
 					offset >> PAGE_CACHE_SHIFT,
end);
 			if (err)
-				retval = err;
+				retval = aio_propogate_error(iocb,
retval, err);
 		}
 	}
 	return retval;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-20 16:57   ` Ananiev, Leonid I
@ 2007-02-20 17:03     ` Chris Mason
  2007-02-20 17:17       ` Ananiev, Leonid I
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-20 17:03 UTC (permalink / raw)
  To: Ananiev, Leonid I
  Cc: Zach Brown, linux-aio, linux-kernel, Benjamin LaHaise,
	Suparna bhattacharya, Andrew Morton

On Tue, Feb 20, 2007 at 07:57:49PM +0300, Ananiev, Leonid I wrote:
> Zach> This addresses an oops reported by Leonid Ananiev
> <leonid.i.ananiev@intel.com>
> Zach> as archived at http://lkml.org/lkml/2007/2/8/337.
> ....
> Zach> This was tested by running O_DIRECT aio-stress concurrently with
> buffered reads
> 
> The oops was with aio-stress only running in the loop
> WITHOUT buffered or mmaped IO which are patched and discussed now.
> Actually 47% aio is finished with EIO after patch.

I looked through the thread and couldn't find the aio-stress command
line used for the whole test.  Could you please post it here?

If aio+dio is being used to extend the file, you'll get pages in the
page cache, which is hopefully why you're able to trigger the EIOs.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-20 17:03     ` Chris Mason
@ 2007-02-20 17:17       ` Ananiev, Leonid I
  2007-02-20 17:54         ` Chris Mason
  0 siblings, 1 reply; 18+ messages in thread
From: Ananiev, Leonid I @ 2007-02-20 17:17 UTC (permalink / raw)
  To: Chris Mason
  Cc: Zach Brown, linux-aio, linux-kernel, Benjamin LaHaise,
	Suparna bhattacharya, Andrew Morton

aio-stress command lines used for test
1) mem=1G in kernel boot param if you have more
2) mk2fs for test_file
3) dd if=/dev/zero of=<test_file> bs=1M count=1200
4) aiostress -s 1200m -o 2 -i 1 -r 16k <test_file>

Leonid
-----Original Message-----
From: Chris Mason [mailto:chris.mason@oracle.com] 
Sent: Tuesday, February 20, 2007 8:04 PM
To: Ananiev, Leonid I
Cc: Zach Brown; linux-aio@kvack.org; linux-kernel@vger.kernel.org;
Benjamin LaHaise; Suparna bhattacharya; Andrew Morton
Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

On Tue, Feb 20, 2007 at 07:57:49PM +0300, Ananiev, Leonid I wrote:
> Zach> This addresses an oops reported by Leonid Ananiev
> <leonid.i.ananiev@intel.com>
> Zach> as archived at http://lkml.org/lkml/2007/2/8/337.
> ....
> Zach> This was tested by running O_DIRECT aio-stress concurrently with
> buffered reads
> 
> The oops was with aio-stress only running in the loop
> WITHOUT buffered or mmaped IO which are patched and discussed now.
> Actually 47% aio is finished with EIO after patch.

I looked through the thread and couldn't find the aio-stress command
line used for the whole test.  Could you please post it here?

If aio+dio is being used to extend the file, you'll get pages in the
page cache, which is hopefully why you're able to trigger the EIOs.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
  2007-02-19 22:48   ` Jeff Moyer
  2007-02-20 16:57   ` Ananiev, Leonid I
@ 2007-02-20 17:33   ` Ananiev, Leonid I
  2007-02-20 18:26     ` Zach Brown
  2007-02-21 14:13   ` Suparna Bhattacharya
  3 siblings, 1 reply; 18+ messages in thread
From: Ananiev, Leonid I @ 2007-02-20 17:33 UTC (permalink / raw)
  To: Zach Brown, linux-aio, linux-kernel
  Cc: Benjamin LaHaise, Suparna bhattacharya, Andrew Morton

There is change in the patch which is uncommented in preface.
Now aio_ring_insert_entry() is not called if req->ki_users>=1.
Before was called.
Could you comment it?

Leonid
-----Original Message-----
From: Zach Brown [mailto:zach.brown@oracle.com] 
Sent: Tuesday, February 20, 2007 12:39 AM
To: linux-aio@kvack.org; linux-kernel@vger.kernel.org
Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev,
Leonid I
Subject: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev
<leonid.i.ananiev@intel.com>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its
intention to
call aio_complete() once the bios complete.   As we return from
submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let
the
bio completion call aio_complete().  This stops us from returning errors
after
O_DIRECT submission.

But we have a few places that are very interested in generating errors
after
bio submission.

The most critical of these is invalidating the page cache after a write.
This
avoids exposing stale data to buffered operations that are performed
after the
O_DIRECT write succeeds.  We must do this after submission because the
user
buffer might have been an mmap()ed buffer of the region being written
to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in
stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This
gives the
submission path time to queue its pending error before it drops its
reference.
This increases the space in the iocb as it has to record the two result
codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with
buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of
a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees
-EIO.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---

 fs/aio.c            |   49 +++++++++++++++++++++---------------------
 include/linux/aio.h |   30 +++++++++++++++++++++++++
 mm/filemap.c        |    4 +--
 3 files changed, 57 insertions(+), 26 deletions(-)

--- a/fs/aio.c	Mon Feb 19 13:12:20 2007 -0800
+++ b/fs/aio.c	Mon Feb 19 13:16:00 2007 -0800
@@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
 	kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km);
\
 } while(0)
 
-static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb,
-				  long res, long res2)
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb)
 {
 	struct aio_ring_info	*info;
 	struct aio_ring		*ring;
@@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
 
 	event->obj = (u64)(unsigned long)iocb->ki_obj.user;
 	event->data = iocb->ki_user_data;
-	event->res = res;
-	event->res2 = res2;
-
-	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
+	event->res = iocb->ki_pending_err ? iocb->ki_pending_err :
iocb->ki_res;
+	event->res2 = iocb->ki_res2;
+
+	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
 		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-		res, res2);
+		iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
 
 	/* after flagging the request as done, we
 	 * must never even look at it again
@@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 	req->ki_dtor = NULL;
+	req->ki_pending_err = 0;
 	req->private = NULL;
 	req->ki_iovec = NULL;
 	INIT_LIST_HEAD(&req->ki_run_list);
@@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
+
+	if (kiocbIsInserted(req))
+		aio_ring_insert_entry(ctx, req);
+
 	list_del(&req->ki_list);		/* remove from
active_reqs */
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
@@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb *
 		return 1;
 	}
 
-	/* add a completion event to the ring buffer.
-	 * must be done holding ctx->ctx_lock to prevent
-	 * other code from messing with the tail
-	 * pointer since we might be called from irq
-	 * context.
-	 */
+	/*
+	 * We queue up the completion codes into the iocb.  They are
combined
+	 * with a potential error from the submission path and inserted
into
+	 * the ring once the last reference to the iocb is dropped.
Cancelled
+	 * iocbs don't insert events on completion because userland was
given
+	 * an event directly as part of the cancelation interface.
+ 	 */
 	spin_lock_irqsave(&ctx->ctx_lock, flags);
 
 	if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
 		list_del_init(&iocb->ki_run_list);
 
-	/*
-	 * cancelled requests don't get events, userland was given one
-	 * when the event got cancelled.
-	 */
-	if (kiocbIsCancelled(iocb))
-		goto put_rq;
-
-	aio_ring_insert_entry(ctx, iocb, res, res2);
-
-put_rq:
+	if (!kiocbIsCancelled(iocb)) {
+		iocb->ki_res = res;
+		iocb->ki_res2 = res2;
+		kiocbSetInserted(iocb);
+	}
+
 	/* everything turned out well, dispose of the aiocb. */
 	ret = __aio_put_req(ctx, iocb);
 
diff -r 8a740eb579d4 include/linux/aio.h
--- a/include/linux/aio.h	Mon Feb 19 13:12:20 2007 -0800
+++ b/include/linux/aio.h	Mon Feb 19 13:16:00 2007 -0800
@@ -34,6 +34,7 @@ struct kioctx;
 /* #define KIF_LOCKED		0 */
 #define KIF_KICKED		1
 #define KIF_CANCELLED		2
+#define KIF_INSERTED		4
 
 #define kiocbTryLock(iocb)	test_and_set_bit(KIF_LOCKED,
&(iocb)->ki_flags)
 #define kiocbTryKick(iocb)	test_and_set_bit(KIF_KICKED,
&(iocb)->ki_flags)
@@ -41,6 +42,7 @@ struct kioctx;
 #define kiocbSetLocked(iocb)	set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)	set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)	set_bit(KIF_CANCELLED,
&(iocb)->ki_flags)
+#define kiocbSetInserted(iocb)	set_bit(KIF_INSERTED, &(iocb)->ki_flags)
 
 #define kiocbClearLocked(iocb)	clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbClearKicked(iocb)	clear_bit(KIF_KICKED, &(iocb)->ki_flags)
@@ -49,6 +51,7 @@ struct kioctx;
 #define kiocbIsLocked(iocb)	test_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbIsKicked(iocb)	test_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbIsCancelled(iocb)	test_bit(KIF_CANCELLED,
&(iocb)->ki_flags)
+#define kiocbIsInserted(iocb)	test_bit(KIF_INSERTED,
&(iocb)->ki_flags)
 
 /* is there a better place to document function pointer methods? */
 /**
@@ -119,6 +122,10 @@ struct kiocb {
 
 	struct list_head	ki_list;	/* the aio core uses
this
 						 * for cancellation */
+	/* we store and combine return codes from submission and
completion */ 
+	int			ki_pending_err;
+	long			ki_res;
+	long			ki_res2;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
@@ -246,6 +253,29 @@ static inline struct kiocb *list_kiocb(s
 	return list_entry(h, struct kiocb, ki_list);
 }
 
+/*
+ * This function is used to make sure that an error is communicated to
+ * userspace on iocb completion without stopping -EIOCBQUEUED from
bubbling up
+ * to fs/aio.c from the place where it originated.
+ *
+ * If we have an existing -EIOCBQUEUED it must be returned all the way
to
+ * fs/aio.c so that it doesn't double-complete the iocb along with
whoever
+ * returned -EIOCBQUEUED..  In that case we put the new error in the
iocb.  It
+ * will be returned to userspace *intead of* the first result code
given to
+ * aio_complete().  Use this only for errors which must overwrite
whatever the
+ * return code might have been.  The first non-zero new_err given to
this
+ * function for a given iocb will be returned to userspace.
+ */
+static inline int aio_propogate_error(struct kiocb *iocb, int
existing_err,
+				      int new_err)
+{
+	if (existing_err != -EIOCBQUEUED)
+		return new_err;
+	if (!iocb->ki_pending_err)
+		iocb->ki_pending_err = new_err;
+	return -EIOCBQUEUED;
+}
+
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
diff -r 8a740eb579d4 mm/filemap.c
--- a/mm/filemap.c	Mon Feb 19 13:12:20 2007 -0800
+++ b/mm/filemap.c	Mon Feb 19 13:16:00 2007 -0800
@@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb *
 	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		int err = generic_osync_inode(inode, mapping,
OSYNC_METADATA);
 		if (err < 0)
-			written = err;
+			written = aio_propogate_error(iocb, written,
err);
 	}
 	return written;
 }
@@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki
 			int err = invalidate_inode_pages2_range(mapping,
 					offset >> PAGE_CACHE_SHIFT,
end);
 			if (err)
-				retval = err;
+				retval = aio_propogate_error(iocb,
retval, err);
 		}
 	}
 	return retval;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-20 17:17       ` Ananiev, Leonid I
@ 2007-02-20 17:54         ` Chris Mason
  2007-02-20 21:09           ` Ananiev, Leonid I
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-20 17:54 UTC (permalink / raw)
  To: Ananiev, Leonid I
  Cc: Zach Brown, linux-aio, linux-kernel, Benjamin LaHaise,
	Suparna bhattacharya, Andrew Morton

On Tue, Feb 20, 2007 at 08:17:51PM +0300, Ananiev, Leonid I wrote:
> aio-stress command lines used for test
> 1) mem=1G in kernel boot param if you have more
> 2) mk2fs for test_file
> 3) dd if=/dev/zero of=<test_file> bs=1M count=1200
> 4) aiostress -s 1200m -o 2 -i 1 -r 16k <test_file>
> 

Sorry, this aio-stress command line doesn't do O_DIRECT.  Do you have
this bundled into a test script that I can try?

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-20 17:33   ` Ananiev, Leonid I
@ 2007-02-20 18:26     ` Zach Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-20 18:26 UTC (permalink / raw)
  To: Ananiev, Leonid I
  Cc: linux-aio, linux-kernel, Benjamin LaHaise, Suparna bhattacharya,
	Andrew Morton


On Feb 20, 2007, at 9:33 AM, Ananiev, Leonid I wrote:

> There is change in the patch which is uncommented in preface.
> Now aio_ring_insert_entry() is not called if req->ki_users>=1.
> Before was called.
> Could you comment it?

It is described in the patch description, though perhaps not  
explicitly enough:

"To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped."

This stops an aio_complete() from generating a completion event  
before the submission path has had a chance to generate an error.   
This could happen if O_DIRECT bios completed very quickly before the  
submission path had a chance to return up and call  
ivalidate_inode_pages2_range().

- z



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-20 17:54         ` Chris Mason
@ 2007-02-20 21:09           ` Ananiev, Leonid I
  2007-02-21  8:35             ` Ken Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Ananiev, Leonid I @ 2007-02-20 21:09 UTC (permalink / raw)
  To: Chris Mason
  Cc: Zach Brown, linux-aio, linux-kernel, Benjamin LaHaise,
	Suparna bhattacharya, Andrew Morton

-O I've missed:
aiostress -s 1200m -O -o 2 -i 1 -r 16k <test_file>

You are right I've used harness scripts.
i=0; while ((i++<50)); do ~/bm/bin/runs I2 -; done &

It runs bmrun harness script which is long for different hardware
configurations and test options.
The lines in aio-stress wrapper which could be useful for you are:
        gcc aiostress.c -Wall -o aiostress -lpthread -laio
2>>${file_log}
...
        sudo umount -f ${pdisk} ${ppath} > /dev/null 2>&1
        sudo /sbin/mke2fs -j ${pdisk} > /dev/null 2>>${file_log}
        sudo mount ${pdisk} ${ppath} > /dev/null 2>>${file_log}...
        export LD_LIBRARY_PATH=/lib:/usr/lib:$LD_LIBRARY_PATH
        all_files_size=1200             # in megabytes
...
        else
                num_disks=1
                ((file_size = all_files_size / num_disks))
                testfiles=${ppath}/test/testfile
                dd if=/dev/zero of=$testfiles bs=1M count=${file_size} >
/dev/null
        fi
        aiostress ${*} -v -o 2 -s ${file_size}m -i 1 -r 16k $testfiles >
${file_res} 2>&1
        if [ "$1" == "-S" ] ; then
                info="O_SYNC"
        elif [ "$1" == "-O" ] ; then
                info="O_DIRECT"
        fi
        awk .... ${file_res}
....
Which does 
1) mem=1G in kernel boot param if you have more
2) unmount; mk2fs; mount
3) dd if=/dev/zero of=<test_file> bs=1M count=1200
4) aiostress -s 1200m -O -o 2 -i 1 -r 16k <test_file>
5) if i++<50 goto 2).

Leonid
-----Original Message-----
From: Chris Mason [mailto:chris.mason@oracle.com] 
Sent: Tuesday, February 20, 2007 8:55 PM
To: Ananiev, Leonid I
Cc: Zach Brown; linux-aio@kvack.org; linux-kernel@vger.kernel.org;
Benjamin LaHaise; Suparna bhattacharya; Andrew Morton
Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

On Tue, Feb 20, 2007 at 08:17:51PM +0300, Ananiev, Leonid I wrote:
> aio-stress command lines used for test
> 1) mem=1G in kernel boot param if you have more
> 2) mk2fs for test_file
> 3) dd if=/dev/zero of=<test_file> bs=1M count=1200
> 4) aiostress -s 1200m -o 2 -i 1 -r 16k <test_file>
> 

Sorry, this aio-stress command line doesn't do O_DIRECT.  Do you have
this bundled into a test script that I can try?

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-20 21:09           ` Ananiev, Leonid I
@ 2007-02-21  8:35             ` Ken Chen
  2007-02-21  8:44               ` Ananiev, Leonid I
  2007-02-21 18:24               ` Zach Brown
  0 siblings, 2 replies; 18+ messages in thread
From: Ken Chen @ 2007-02-21  8:35 UTC (permalink / raw)
  To: Ananiev, Leonid I
  Cc: Chris Mason, Zach Brown, linux-aio, linux-kernel,
	Benjamin LaHaise, Suparna bhattacharya, Andrew Morton

On 2/20/07, Ananiev, Leonid <leonid.i.ananiev@intel.com> wrote:
> 1) mem=1G in kernel boot param if you have more
> 2) unmount; mk2fs; mount
> 3) dd if=/dev/zero of=<test_file> bs=1M count=1200
> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k <test_file>
> 5) if i++<50 goto 2).

Would you please instrument the call chain of
invalidate_complete_page2() and tell us exactly where it returns zero
value in your failure case?

   invalidate_complete_page2
      try_to_release_page
         ext3_releasepage
            journal_try_to_free_buffers
               ???

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-21  8:35             ` Ken Chen
@ 2007-02-21  8:44               ` Ananiev, Leonid I
  2007-02-21 18:24               ` Zach Brown
  1 sibling, 0 replies; 18+ messages in thread
From: Ananiev, Leonid I @ 2007-02-21  8:44 UTC (permalink / raw)
  To: Ken Chen
  Cc: Chris Mason, Zach Brown, linux-aio, linux-kernel,
	Benjamin LaHaise, Suparna bhattacharya, Andrew Morton

> where it returns zero 
I've wrote in the mail http://lkml.org/lkml/2007/2/8/337
invalidate_inode_pages2_range() reports BUG:
warning at mm/truncate.c:398  occurs becouse of
invalidate_complete_page2()    returns 0; it returns 0 because of
try_to_release_page()          returns 0; it returns 0 because of
ext3_releasepage()             returns 0; it returns 0 because of
journal_try_to_free_buffers()  returns 0; it returns 0 because of
try_to_free_buffers()          returns 0; it returns 0 because of
drop_buffers()                 returns 0; it returns 0 because of
buffer_busy(bh)                returns 1; it returns 0 because of
buffer_head count is 1 (bh->b_count==1) as additional printk reports.

Leonid
-----Original Message-----
From: Ken Chen [mailto:kenchen@google.com] 
Sent: Wednesday, February 21, 2007 11:36 AM
To: Ananiev, Leonid I
Cc: Chris Mason; Zach Brown; linux-aio@kvack.org;
linux-kernel@vger.kernel.org; Benjamin LaHaise; Suparna bhattacharya;
Andrew Morton
Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

On 2/20/07, Ananiev, Leonid <leonid.i.ananiev@intel.com> wrote:
> 1) mem=1G in kernel boot param if you have more
> 2) unmount; mk2fs; mount
> 3) dd if=/dev/zero of=<test_file> bs=1M count=1200
> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k <test_file>
> 5) if i++<50 goto 2).

Would you please instrument the call chain of
invalidate_complete_page2() and tell us exactly where it returns zero
value in your failure case?

   invalidate_complete_page2
      try_to_release_page
         ext3_releasepage
            journal_try_to_free_buffers
               ???

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
                     ` (2 preceding siblings ...)
  2007-02-20 17:33   ` Ananiev, Leonid I
@ 2007-02-21 14:13   ` Suparna Bhattacharya
  2007-02-21 18:34     ` Zach Brown
  3 siblings, 1 reply; 18+ messages in thread
From: Suparna Bhattacharya @ 2007-02-21 14:13 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-aio, linux-kernel, Benjamin LaHaise, Andrew Morton, Leonid Ananiev

On Mon, Feb 19, 2007 at 01:38:35PM -0800, Zach Brown wrote:
> aio: propogate post-EIOCBQUEUED errors to completion event
> 
> This addresses an oops reported by Leonid Ananiev <leonid.i.ananiev@intel.com>
> as archived at http://lkml.org/lkml/2007/2/8/337.
> 
> O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to
> call aio_complete() once the bios complete.   As we return from submission we
> must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the
> bio completion call aio_complete().  This stops us from returning errors after
> O_DIRECT submission.
> 
> But we have a few places that are very interested in generating errors after
> bio submission.
> 
> The most critical of these is invalidating the page cache after a write.  This
> avoids exposing stale data to buffered operations that are performed after the
> O_DIRECT write succeeds.  We must do this after submission because the user
> buffer might have been an mmap()ed buffer of the region being written to.  The
> get_user_pages() in the O_DIRECT completion path could have faulted in stale
> data.
> 
> So this patch introduces a helper, aio_propogate_error(), which queues
> post-submission errors in the iocb so that they are given to the user
> completion event when aio_complete() is finally called.
> 
> To get this working we change the aio_complete() path so that the ring
> insertion is performed as the final iocb reference is dropped.  This gives the
> submission path time to queue its pending error before it drops its reference.
> This increases the space in the iocb as it has to record the two result codes
> from aio_complete() and the pending error from the submission path.

This is an interesting trick, but I'd like to consider hard whether the added
complexity is worth it. Could you list the various other cases you have in mind
which would want to use it ?

For the invalidate_inode_pages2_range() case, I wonder if the right
place to issue this is after the direct IO write has completed and before
aio_complete() is issued (somewhat like the way we do bio_check_pages_dirty
for DIO reads), rather than after submission when the IO may still not have
hit the disk. This would also make the behaviour uniform for synchronous and
async cases.

BTW, am I right in interpreting that with your change aio_complete() may
trigger an io_getevents() wakeup, before the corresponding event is placed
on the ring buffer ?

Regards
Suparna

> 
> This was tested by running O_DIRECT aio-stress concurrently with buffered reads
> while triggering EIO in invalidate_inode_pages2_range() with the help of a
> debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
> completion both called aio_complete().  With this patch aio-stress sees -EIO.
> 
> Signed-off-by: Zach Brown <zach.brown@oracle.com>
> ---
> 
>  fs/aio.c            |   49 +++++++++++++++++++++---------------------
>  include/linux/aio.h |   30 +++++++++++++++++++++++++
>  mm/filemap.c        |    4 +--
>  3 files changed, 57 insertions(+), 26 deletions(-)
> 
> --- a/fs/aio.c	Mon Feb 19 13:12:20 2007 -0800
> +++ b/fs/aio.c	Mon Feb 19 13:16:00 2007 -0800
> @@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
>  	kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
>  } while(0)
> 
> -static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb,
> -				  long res, long res2)
> +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb)
>  {
>  	struct aio_ring_info	*info;
>  	struct aio_ring		*ring;
> @@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
> 
>  	event->obj = (u64)(unsigned long)iocb->ki_obj.user;
>  	event->data = iocb->ki_user_data;
> -	event->res = res;
> -	event->res2 = res2;
> -
> -	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
> +	event->res = iocb->ki_pending_err ? iocb->ki_pending_err : iocb->ki_res;
> +	event->res2 = iocb->ki_res2;
> +
> +	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
>  		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
> -		res, res2);
> +		iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
> 
>  	/* after flagging the request as done, we
>  	 * must never even look at it again
> @@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
>  	req->ki_cancel = NULL;
>  	req->ki_retry = NULL;
>  	req->ki_dtor = NULL;
> +	req->ki_pending_err = 0;
>  	req->private = NULL;
>  	req->ki_iovec = NULL;
>  	INIT_LIST_HEAD(&req->ki_run_list);
> @@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
> 
>  	assert_spin_locked(&ctx->ctx_lock);
> 
> -	req->ki_users --;
> +	req->ki_users--;
>  	BUG_ON(req->ki_users < 0);
>  	if (likely(req->ki_users))
>  		return 0;
> +
> +	if (kiocbIsInserted(req))
> +		aio_ring_insert_entry(ctx, req);
> +
>  	list_del(&req->ki_list);		/* remove from active_reqs */
>  	req->ki_cancel = NULL;
>  	req->ki_retry = NULL;
> @@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb *
>  		return 1;
>  	}
> 
> -	/* add a completion event to the ring buffer.
> -	 * must be done holding ctx->ctx_lock to prevent
> -	 * other code from messing with the tail
> -	 * pointer since we might be called from irq
> -	 * context.
> -	 */
> +	/*
> +	 * We queue up the completion codes into the iocb.  They are combined
> +	 * with a potential error from the submission path and inserted into
> +	 * the ring once the last reference to the iocb is dropped.  Cancelled
> +	 * iocbs don't insert events on completion because userland was given
> +	 * an event directly as part of the cancelation interface.
> + 	 */
>  	spin_lock_irqsave(&ctx->ctx_lock, flags);
> 
>  	if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
>  		list_del_init(&iocb->ki_run_list);
> 
> -	/*
> -	 * cancelled requests don't get events, userland was given one
> -	 * when the event got cancelled.
> -	 */
> -	if (kiocbIsCancelled(iocb))
> -		goto put_rq;
> -
> -	aio_ring_insert_entry(ctx, iocb, res, res2);
> -
> -put_rq:
> +	if (!kiocbIsCancelled(iocb)) {
> +		iocb->ki_res = res;
> +		iocb->ki_res2 = res2;
> +		kiocbSetInserted(iocb);
> +	}
> +
>  	/* everything turned out well, dispose of the aiocb. */
>  	ret = __aio_put_req(ctx, iocb);
> 
> diff -r 8a740eb579d4 include/linux/aio.h
> --- a/include/linux/aio.h	Mon Feb 19 13:12:20 2007 -0800
> +++ b/include/linux/aio.h	Mon Feb 19 13:16:00 2007 -0800
> @@ -34,6 +34,7 @@ struct kioctx;
>  /* #define KIF_LOCKED		0 */
>  #define KIF_KICKED		1
>  #define KIF_CANCELLED		2
> +#define KIF_INSERTED		4
> 
>  #define kiocbTryLock(iocb)	test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
>  #define kiocbTryKick(iocb)	test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
> @@ -41,6 +42,7 @@ struct kioctx;
>  #define kiocbSetLocked(iocb)	set_bit(KIF_LOCKED, &(iocb)->ki_flags)
>  #define kiocbSetKicked(iocb)	set_bit(KIF_KICKED, &(iocb)->ki_flags)
>  #define kiocbSetCancelled(iocb)	set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
> +#define kiocbSetInserted(iocb)	set_bit(KIF_INSERTED, &(iocb)->ki_flags)
> 
>  #define kiocbClearLocked(iocb)	clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
>  #define kiocbClearKicked(iocb)	clear_bit(KIF_KICKED, &(iocb)->ki_flags)
> @@ -49,6 +51,7 @@ struct kioctx;
>  #define kiocbIsLocked(iocb)	test_bit(KIF_LOCKED, &(iocb)->ki_flags)
>  #define kiocbIsKicked(iocb)	test_bit(KIF_KICKED, &(iocb)->ki_flags)
>  #define kiocbIsCancelled(iocb)	test_bit(KIF_CANCELLED, &(iocb)->ki_flags)
> +#define kiocbIsInserted(iocb)	test_bit(KIF_INSERTED, &(iocb)->ki_flags)
> 
>  /* is there a better place to document function pointer methods? */
>  /**
> @@ -119,6 +122,10 @@ struct kiocb {
> 
>  	struct list_head	ki_list;	/* the aio core uses this
>  						 * for cancellation */
> +	/* we store and combine return codes from submission and completion */ 
> +	int			ki_pending_err;
> +	long			ki_res;
> +	long			ki_res2;
>  };
> 
>  #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
> @@ -246,6 +253,29 @@ static inline struct kiocb *list_kiocb(s
>  	return list_entry(h, struct kiocb, ki_list);
>  }
> 
> +/*
> + * This function is used to make sure that an error is communicated to
> + * userspace on iocb completion without stopping -EIOCBQUEUED from bubbling up
> + * to fs/aio.c from the place where it originated.
> + *
> + * If we have an existing -EIOCBQUEUED it must be returned all the way to
> + * fs/aio.c so that it doesn't double-complete the iocb along with whoever
> + * returned -EIOCBQUEUED..  In that case we put the new error in the iocb.  It
> + * will be returned to userspace *intead of* the first result code given to
> + * aio_complete().  Use this only for errors which must overwrite whatever the
> + * return code might have been.  The first non-zero new_err given to this
> + * function for a given iocb will be returned to userspace.
> + */
> +static inline int aio_propogate_error(struct kiocb *iocb, int existing_err,
> +				      int new_err)
> +{
> +	if (existing_err != -EIOCBQUEUED)
> +		return new_err;
> +	if (!iocb->ki_pending_err)
> +		iocb->ki_pending_err = new_err;
> +	return -EIOCBQUEUED;
> +}
> +
>  /* for sysctl: */
>  extern unsigned long aio_nr;
>  extern unsigned long aio_max_nr;
> diff -r 8a740eb579d4 mm/filemap.c
> --- a/mm/filemap.c	Mon Feb 19 13:12:20 2007 -0800
> +++ b/mm/filemap.c	Mon Feb 19 13:16:00 2007 -0800
> @@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb *
>  	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
>  		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
>  		if (err < 0)
> -			written = err;
> +			written = aio_propogate_error(iocb, written, err);
>  	}
>  	return written;
>  }
> @@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki
>  			int err = invalidate_inode_pages2_range(mapping,
>  					offset >> PAGE_CACHE_SHIFT, end);
>  			if (err)
> -				retval = err;
> +				retval = aio_propogate_error(iocb, retval, err);
>  		}
>  	}
>  	return retval;

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-21  8:35             ` Ken Chen
  2007-02-21  8:44               ` Ananiev, Leonid I
@ 2007-02-21 18:24               ` Zach Brown
  2007-02-21 19:15                 ` Ananiev, Leonid I
  1 sibling, 1 reply; 18+ messages in thread
From: Zach Brown @ 2007-02-21 18:24 UTC (permalink / raw)
  To: Ken Chen
  Cc: Ananiev, Leonid I, Chris Mason, linux-aio,
	Linux Kernel Mailing List, Benjamin LaHaise,
	Suparna bhattacharya, Andrew Morton, Badari Pulavarty


On Feb 21, 2007, at 12:35 AM, Ken Chen wrote:

> On 2/20/07, Ananiev, Leonid <leonid.i.ananiev@intel.com> wrote:
>> 1) mem=1G in kernel boot param if you have more
>> 2) unmount; mk2fs; mount
>> 3) dd if=/dev/zero of=<test_file> bs=1M count=1200
>> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k <test_file>
>> 5) if i++<50 goto 2).
>
> Would you please instrument the call chain of
> invalidate_complete_page2() and tell us exactly where it returns zero
> value in your failure case?
>
>   invalidate_complete_page2
>      try_to_release_page
>         ext3_releasepage
>            journal_try_to_free_buffers
>               ???

For what it's worth, Badari has explained this race in the past in a  
credible way.  I'll take the liberty of pasting a mail from him:

"
kjournald submited buffers for IO and waiting for them to finish.
Note that it has a ref. against the buffer.

journal_commit_transaction()
         ...
         submited buffers for IO
         /* Waiting for IO to complete */
         while (commit_transaction->t_locked_list) {
                 ...
                 get_bh(bh);
                 if (buffer_locked(bh)) {
                         spin_unlock(&journal->j_list_lock);
                         wait_on_buffer(bh);  <<<<<<
                         spin_lock(&journal->j_list_lock);
                 }

                 ..
                 put_bh(bh);
         }

Now, DIO process comes to frees the jh through  
journal_try_to_free_buffers()
but fails to drop_buffers() since kjournald() has a reference against  
it.
invalidate_inode_pages2_range()
         ..
         ext3_releasepage()
                 journal_try_to_free_buffers()
                         journal_put_journal_head()
                                 __journal_try_to_free_buffer()
                                         <--- freed jh

                         try_to_free_buffers()
                                 drop_buffers()
                                         if (buffer_busy(bh))
                                                 goto failed;
                                           <<--- returns EIO due to  
b_count

"

I don't mean to say that we shouldn't get traces to confirm the  
theory, just sharing.  And now we can point to this in the archives  
next time :).

- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-21 14:13   ` Suparna Bhattacharya
@ 2007-02-21 18:34     ` Zach Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-21 18:34 UTC (permalink / raw)
  To: suparna
  Cc: linux-aio, linux-kernel, Benjamin LaHaise, Andrew Morton, Leonid Ananiev

> This is an interesting trick, but I'd like to consider hard whether  
> the added
> complexity is worth it. Could you list the various other cases you  
> have in mind
> which would want to use it ?

I'm happy to report that the sync case and the  
invalidate_inode_pages2_range() case are the only two which try to  
rewrite 'ret'.  I audited all the call paths between aio_{read,write}  
and -EIOCBQUUEUED.

So, sure, maybe we can shuffle the house of cards a little in the  
direction away from having a fs/aio.c helper for the situation by  
simply removing the few current instances of the problem.  It won't  
scale as people add problems without knowing about the rule to not  
overwrite -EIOCBQUEUED, but hopefully that won't be a rule for much  
longer :) :).

For the O_SYNC case we could add another magical test to the is_async  
assignment which won't perform AIO on O_SYNC descriptors.

> For the invalidate_inode_pages2_range() case, I wonder if the right
> place to issue this is after the direct IO write has completed and  
> before
> aio_complete() is issued (somewhat like the way we do  
> bio_check_pages_dirty
> for DIO reads), rather than after submission when the IO may still  
> not have
> hit the disk. This would also make the behaviour uniform for  
> synchronous and
> async cases.

Hmm, I think I like that.  It solves the problem for the current sole  
user of -EIOCBQUEUED without too much disruption.

> BTW, am I right in interpreting that with your change aio_complete 
> () may
> trigger an io_getevents() wakeup, before the corresponding event is  
> placed
> on the ring buffer ?

Hmm, yeah, it looks like I goofed that.

I'll roll a patch which does the invalidation down in fs/direct-io.c.

- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
  2007-02-21 18:24               ` Zach Brown
@ 2007-02-21 19:15                 ` Ananiev, Leonid I
  0 siblings, 0 replies; 18+ messages in thread
From: Ananiev, Leonid I @ 2007-02-21 19:15 UTC (permalink / raw)
  To: Zach Brown, Ken Chen
  Cc: Chris Mason, linux-aio, Linux Kernel Mailing List,
	Benjamin LaHaise, Suparna bhattacharya, Andrew Morton,
	Badari Pulavarty

> kjournald submited buffers for IO and waiting for them to finish.

It is means that the patch incorrectly moves internal kernel
synchronization
problem into user space as EIO instead of fixing a root cause or perform
iterative
synchronization.
After patching users will be surprised a lot of EIO
(remember 47% EIO in aio-stress runs after patching) will buy new
disk...
This patch is harmful.

Leonid
>-----Original Message-----
>From: Zach Brown [mailto:zach.brown@oracle.com]
>Sent: Wednesday, February 21, 2007 9:25 PM
>To: Ken Chen
>Cc: Ananiev, Leonid I; Chris Mason; linux-aio; Linux Kernel Mailing
List; Benjamin LaHaise; Suparna
>bhattacharya; Andrew Morton; Badari Pulavarty
>Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event
>
>
>On Feb 21, 2007, at 12:35 AM, Ken Chen wrote:
>
>> On 2/20/07, Ananiev, Leonid <leonid.i.ananiev@intel.com> wrote:
>>> 1) mem=1G in kernel boot param if you have more
>>> 2) unmount; mk2fs; mount
>>> 3) dd if=/dev/zero of=<test_file> bs=1M count=1200
>>> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k <test_file>
>>> 5) if i++<50 goto 2).
>>
>> Would you please instrument the call chain of
>> invalidate_complete_page2() and tell us exactly where it returns zero
>> value in your failure case?
>>
>>   invalidate_complete_page2
>>      try_to_release_page
>>         ext3_releasepage
>>            journal_try_to_free_buffers
>>               ???
>
>For what it's worth, Badari has explained this race in the past in a
>credible way.  I'll take the liberty of pasting a mail from him:
>
>"
>kjournald submited buffers for IO and waiting for them to finish.
>Note that it has a ref. against the buffer.
>
>journal_commit_transaction()
>         ...
>         submited buffers for IO
>         /* Waiting for IO to complete */
>         while (commit_transaction->t_locked_list) {
>                 ...
>                 get_bh(bh);
>                 if (buffer_locked(bh)) {
>                         spin_unlock(&journal->j_list_lock);
>                         wait_on_buffer(bh);  <<<<<<
>                         spin_lock(&journal->j_list_lock);
>                 }
>
>                 ..
>                 put_bh(bh);
>         }
>
>Now, DIO process comes to frees the jh through
>journal_try_to_free_buffers()
>but fails to drop_buffers() since kjournald() has a reference against
>it.
>invalidate_inode_pages2_range()
>         ..
>         ext3_releasepage()
>                 journal_try_to_free_buffers()
>                         journal_put_journal_head()
>                                 __journal_try_to_free_buffer()
>                                         <--- freed jh
>
>                         try_to_free_buffers()
>                                 drop_buffers()
>                                         if (buffer_busy(bh))
>                                                 goto failed;
>                                           <<--- returns EIO due to
>b_count
>
>"
>
>I don't mean to say that we shouldn't get traces to confirm the
>theory, just sharing.  And now we can point to this in the archives
>next time :).
>
>- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-02-21 19:15 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-19 21:38 [PATCH 1/2] aio: create aio_ring_insert_entry() function Zach Brown
2007-02-19 21:38 ` [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event Zach Brown
2007-02-19 22:48   ` Jeff Moyer
2007-02-19 22:57     ` Zach Brown
2007-02-20 16:57   ` Ananiev, Leonid I
2007-02-20 17:03     ` Chris Mason
2007-02-20 17:17       ` Ananiev, Leonid I
2007-02-20 17:54         ` Chris Mason
2007-02-20 21:09           ` Ananiev, Leonid I
2007-02-21  8:35             ` Ken Chen
2007-02-21  8:44               ` Ananiev, Leonid I
2007-02-21 18:24               ` Zach Brown
2007-02-21 19:15                 ` Ananiev, Leonid I
2007-02-20 17:33   ` Ananiev, Leonid I
2007-02-20 18:26     ` Zach Brown
2007-02-21 14:13   ` Suparna Bhattacharya
2007-02-21 18:34     ` Zach Brown
2007-02-19 22:41 ` [PATCH 1/2] aio: create aio_ring_insert_entry() function Jeff Moyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.