linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][Patch] Retry based aio read for filesystems
@ 2003-03-05  9:17 Suparna Bhattacharya
  2003-03-05  9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05  9:17 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-aio, linux-kernel

For the last few days I've been playing with prototyping 
a particular flavour of a retry based implementation for 
filesystem aio read.

It is rather experimental at this point, with only 
limited bits of testing so far. [There are some potential 
races which seem to show up in the case of large reads 
requiring faulting in of the user space buffer. I've 
chosen to leave some printks on now ] 

Am posting these initial patches as responses to this
mail for early comments/feedback on the approach and 
to improve chances of unearthing any potential 
gotchas sooner rather than later :)

A few alternative variations are possible around the  
basic retry-driven theme that Ben LaHaise has designed - 
each with their pros and cons, so I am hoping that sharing 
experiences on what we're trying out may be a step 
towards some insights in determining what works best. 

Its in two parts:
aioretry.patch : Core aio infrastructure modifications
aioread.patch  : Modifications for aio read in 
particular (generic_file_aio_read)

I was working with 2.5.62, and haven't yet tried 
upgrading to 2.5.64.

The idea was to :
- Keep things as simple as possible with minimal changes 
  to the current filesystem read/write paths (which I 
  guess we don't want to disturb too much at this stage). 
  The way to do this is to make the base aio infrastructure 
  handle most of the async (retry) complexities.
- Take incremental steps towards making an operation async 
  - start with blocking->async conversions for the major
  blocking points.
- Keep in mind Dave Miller's concern about effect of aio
  additions (extra parameters) on sync i/o behaviour. So
  retaining sync i/o performance as far as possible gets 
  precedence over aio performance and latency right now.
- Take incremental steps towards tuning aio performance, 
  and optimizing specific paths for aio.

A retry-based model implies that each time we make as 
much progress as is possible in a non-blocking manner,
and then defer a restart of the operation from where we 
left off, at the next opportunity ... and so on, until
we finish. To make sure that a next opportunity does
indeed arise, each time we do _more_ than what we'd do 
for simple non-blocking i/o -- we initiate some steps
towards progress without actually waiting for it. Then
we try to mark ourselves to be notified when enough is
ready to try the next restart-from-where-we-left-off 
attempt.

3 decisions characterise this particular variation:

1. At what level to retry ? 

   This patch does it at the highest level (fops level),
   i.e. high level aio code retries the generic read 
   operation passing in the remaining parts of the buffer 
   to be read.

   No retries happen in the sync case at the moment,
   though the option exists.

2. What kind of state needs to be saved across retries
   and who maintains that ?

   The high level aio code keeps adjusting the parameters 
   to the read as retries progress (this is done by the
   routine calling the fops)

   There is, however, room for optimizations when 
   low-level paths can be modified to reuse state 
   across aio-retries.

3. When to trigger a retry ?
   
   This is the tricky part. This variation uses async 
   wait queue functions instead of the blocking wait, to
   trigger a retry (kick_iocb) when data becomes available. 
   So the synchronous lock_page, wait_on_page_bit have
   their async variations which do an async wait and
   return -EIOCBQUEUED (which is propogated up).

   (BTW, the races that I'm running into are mostly 
   related to this area - avoiding simoultaneous retries,
   dealing with completion while retry is going on
   etc. We need to audit the code and think about this 
   more closely. I wanted to keep the async wakeup  
   as simple as possible ... and deal with the races
   in some other way)

Ben, 

Is this anywhere close to what you had in mind, or
have played with, or were you aiming for retries
at a different (lower) level ?  How's your experience 
been with other variations that you've tried ? 

Would be interesting to take a look and compare notes,
if possible. 
Any chance you'll be putting something out ? 

I guess the retry state you maintain may be a little more 
specific to the target/type of aio, or is implemented in
a way lets state already updated by lower level functions 
to be reused/carried over directly, rather than recomputed 
by the aio code ... - does that work out to be more 
optimal ? Or were you working on a different technique 
altogether ?

There are a couple of fixes that I made in the aio code
as part of the patch. 
- The kiocbClearXXX were doing a set_bit instead of 
  clear_bit
- Sync iocbs were not woken up when iocb->ki_users = 1
  (dio takes a different path for sync and async iocbs,
  so maybe that's why we weren't seeing the problem yet)

Hope that's okay.

Regards
Suparna
-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 1/2] Retry based aio read - core aio changes
  2003-03-05  9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
@ 2003-03-05  9:26 ` Suparna Bhattacharya
  2003-03-14 13:23   ` Suparna Bhattacharya
  2003-03-05  9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
  2003-03-05 23:00 ` [RFC][Patch] Retry based aio read for filesystems Janet Morgan
  2 siblings, 1 reply; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05  9:26 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-aio, linux-kernel

On Wed, Mar 05, 2003 at 02:47:54PM +0530, Suparna Bhattacharya wrote:
> For the last few days I've been playing with prototyping 
> a particular flavour of a retry based implementation for 
> filesystem aio read.

 
# aioretry.patch : Core aio infrastructure modifications
		   for high-level retry based aio

Includes couple of fixes in the aio code
-The kiocbClearXXX were doing a set_bit instead of 
 clear_bit
-Sync iocbs were not woken up when iocb->ki_users = 1
 (dio takes a different path for sync and async iocbs,
 so maybe that's why we weren't seeing the problem yet)

Regards
Suparna

----------------------------------------------------

diff -ur linux-2.5.62/fs/aio.c linux-2.5.62-aio/fs/aio.c
--- linux-2.5.62/fs/aio.c	Tue Feb 18 04:26:14 2003
+++ linux-2.5.62-aio/fs/aio.c	Tue Mar  4 19:54:24 2003
@@ -314,14 +314,15 @@
  */
 ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
 {
-	while (iocb->ki_users) {
+	printk("wait_on_sync_iocb\n");
+	while ((iocb->ki_users) && !kiocbIsKicked(iocb)) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!iocb->ki_users)
+		if (!iocb->ki_users || kiocbIsKicked(iocb))
 			break;
 		schedule();
 	}
 	__set_current_state(TASK_RUNNING);
-	return iocb->ki_user_data;
+	return iocb->ki_users ? -EIOCBQUEUED : iocb->ki_user_data;
 }
 
 /* exit_aio: called when the last user of mm goes away.  At this point, 
@@ -395,6 +396,7 @@
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 	req->ki_user_obj = NULL;
+	INIT_LIST_HEAD(&req->ki_run_list);
 
 	/* Check if the completion queue has enough free space to
 	 * accept an event from this io.
@@ -558,15 +560,20 @@
 	enter_lazy_tlb(mm, current, smp_processor_id());
 }
 
-/* Run on kevent's context.  FIXME: needs to be per-cpu and warn if an
- * operation blocks.
- */
-static void aio_kick_handler(void *data)
+static inline int __queue_kicked_iocb(struct kiocb *iocb)
 {
-	struct kioctx *ctx = data;
+	struct kioctx	*ctx = iocb->ki_ctx;
 
-	use_mm(ctx->mm);
+	if (list_empty(&iocb->ki_run_list)) {
+		list_add_tail(&iocb->ki_run_list, 
+			&ctx->run_list);
+		return 1;
+	}
+	return 0;
+}
 
+static void aio_run_iocbs(struct kioctx *ctx)
+{
 	spin_lock_irq(&ctx->ctx_lock);
 	while (!list_empty(&ctx->run_list)) {
 		struct kiocb *iocb;
@@ -574,30 +581,67 @@
 
 		iocb = list_entry(ctx->run_list.next, struct kiocb,
 				  ki_run_list);
-		list_del(&iocb->ki_run_list);
+		list_del_init(&iocb->ki_run_list);
 		iocb->ki_users ++;
-		spin_unlock_irq(&ctx->ctx_lock);
 
-		kiocbClearKicked(iocb);
-		ret = iocb->ki_retry(iocb);
-		if (-EIOCBQUEUED != ret) {
-			aio_complete(iocb, ret, 0);
-			iocb = NULL;
+		if (!kiocbTryStart(iocb)) {
+			kiocbClearKicked(iocb);
+			spin_unlock_irq(&ctx->ctx_lock);
+			ret = iocb->ki_retry(iocb);
+			if (-EIOCBQUEUED != ret) {
+				if (list_empty(&iocb->ki_wait.task_list))
+					aio_complete(iocb, ret, 0);
+				else
+					printk("can't delete iocb in use\n");
+			}
+			spin_lock_irq(&ctx->ctx_lock);
+			kiocbClearStarted(iocb);
+			if (kiocbIsKicked(iocb))
+				__queue_kicked_iocb(iocb);
+		} else {
+			printk("iocb already started\n");
 		}
-
-		spin_lock_irq(&ctx->ctx_lock);
 		if (NULL != iocb)
 			__aio_put_req(ctx, iocb);
 	}
 	spin_unlock_irq(&ctx->ctx_lock);
 
+}
+
+/* Run on aiod/kevent's context.  FIXME: needs to be per-cpu and warn if an
+ * operation blocks.
+ */
+static void aio_kick_handler(void *data)
+{
+	struct kioctx *ctx = data;
+
+	use_mm(ctx->mm);
+	aio_run_iocbs(ctx);
 	unuse_mm(ctx->mm);
 }
 
-void kick_iocb(struct kiocb *iocb)
+
+void queue_kicked_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
+	unsigned long flags;
+	int run = 0;
 
+	WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
+
+	spin_lock_irqsave(&ctx->ctx_lock, flags);
+	run = __queue_kicked_iocb(iocb);
+	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+	if (run) {
+		if (waitqueue_active(&ctx->wait))
+			wake_up(&ctx->wait);
+		else
+			queue_work(aio_wq, &ctx->wq);
+	}
+}
+
+void kick_iocb(struct kiocb *iocb)
+{
 	/* sync iocbs are easy: they can only ever be executing from a 
 	 * single context. */
 	if (is_sync_kiocb(iocb)) {
@@ -606,12 +650,11 @@
 		return;
 	}
 
-	if (!kiocbTryKick(iocb)) {
-		unsigned long flags;
-		spin_lock_irqsave(&ctx->ctx_lock, flags);
-		list_add_tail(&iocb->ki_run_list, &ctx->run_list);
-		spin_unlock_irqrestore(&ctx->ctx_lock, flags);
-		schedule_work(&ctx->wq);
+
+	if (!kiocbTryKick(iocb) && !kiocbIsStarted(iocb)) {
+		queue_kicked_iocb(iocb);
+	} else {
+		pr_debug("iocb already kicked or in progress\n");
 	}
 }
 
@@ -642,13 +685,13 @@
 		iocb->ki_user_data = res;
 		if (iocb->ki_users == 1) {
 			iocb->ki_users = 0;
-			return 1;
+			ret = 1;
+		} else {
+			spin_lock_irq(&ctx->ctx_lock);
+			iocb->ki_users--;
+			ret = (0 == iocb->ki_users);
+			spin_unlock_irq(&ctx->ctx_lock);
 		}
-		spin_lock_irq(&ctx->ctx_lock);
-		iocb->ki_users--;
-		ret = (0 == iocb->ki_users);
-		spin_unlock_irq(&ctx->ctx_lock);
-
 		/* sync iocbs put the task here for us */
 		wake_up_process(iocb->ki_user_obj);
 		return ret;
@@ -664,6 +707,9 @@
 	 */
 	spin_lock_irqsave(&ctx->ctx_lock, flags);
 
+	if (!list_empty(&iocb->ki_run_list))
+		list_del_init(&iocb->ki_run_list);
+
 	ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
 
 	tail = info->tail;
@@ -865,6 +911,8 @@
 			ret = 0;
 			if (to.timed_out)	/* Only check after read evt */
 				break;
+			/* accelerate kicked iocbs for this ctx */	
+			aio_run_iocbs(ctx);
 			schedule();
 			if (signal_pending(tsk)) {
 				ret = -EINTR;
@@ -984,6 +1032,114 @@
 	return -EINVAL;
 }
 
+/* Called during initial submission and subsequent retry operations */
+long aio_process_iocb(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+	
+	if (iocb->ki_retried++ > 1024*1024) {
+		printk("Maximal retry count. Bytes done %d\n",
+			iocb->ki_nbytes - iocb->ki_left);
+		return -EAGAIN;
+	}
+
+	if (!(iocb->ki_retried & 0xff)) {
+		printk("%ld aio retries completed %d bytes of %d\n",
+			iocb->ki_retried, 
+			iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
+	}
+
+	BUG_ON(current->iocb != NULL);
+			
+	current->iocb = iocb;
+
+	switch (iocb->ki_opcode) {
+	case IOCB_CMD_PREAD:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_READ)))
+			goto out;
+		ret = -EFAULT;
+		if (unlikely(!access_ok(VERIFY_WRITE, iocb->ki_buf, 
+			iocb->ki_left)))
+			goto out;
+		ret = -EINVAL;
+		if (file->f_op->aio_read)
+			ret = file->f_op->aio_read(iocb, iocb->ki_buf,
+				iocb->ki_left, iocb->ki_pos);
+		break;
+	case IOCB_CMD_PWRITE:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_WRITE)))
+			goto out;
+		ret = -EFAULT;
+		if (unlikely(!access_ok(VERIFY_READ, iocb->ki_buf, 
+			iocb->ki_left)))
+			goto out;
+		ret = -EINVAL;
+		if (file->f_op->aio_write)
+			ret = file->f_op->aio_write(iocb, iocb->ki_buf,
+				iocb->ki_left, iocb->ki_pos);
+		break;
+	case IOCB_CMD_FDSYNC:
+		ret = -EINVAL;
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(iocb, 1);
+		break;
+	case IOCB_CMD_FSYNC:
+		ret = -EINVAL;
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(iocb, 0);
+		break;
+	default:
+		dprintk("EINVAL: io_submit: no operation provided\n");
+		ret = -EINVAL;
+	}
+
+	pr_debug("aio_process_iocb: fop ret %d\n", ret);
+	if (likely(-EIOCBQUEUED == ret)) {
+		blk_run_queues();
+		goto out;
+	}
+	if (ret > 0) {
+		iocb->ki_buf += ret;
+		iocb->ki_left -= ret;
+
+		/* Not done yet or a short read, or.. */
+		if (iocb->ki_left 
+		/* may have copied out data but not completed writing */
+			|| ((iocb->ki_left == 0) &&
+			(iocb->ki_opcode = IOCB_CMD_PWRITE)) ){
+			/* FIXME:Can we find a better way to handle this ? */
+			/* Force an extra retry to determine if we're done */
+			ret = -EIOCBQUEUED;
+			goto out;
+		}
+		
+	}
+
+	if (ret >= 0) 
+		ret = iocb->ki_nbytes - iocb->ki_left;
+
+out:
+	pr_debug("ki_pos = %llu\n", iocb->ki_pos);
+	current->iocb = NULL;
+	if ((-EIOCBQUEUED == ret) && list_empty(&iocb->ki_wait.task_list)) {
+		kiocbSetKicked(iocb);
+	}
+
+	return ret;
+}
+
+int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+{
+	struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+
+	list_del_init(&wait->task_list);
+	kick_iocb(iocb);
+	return 1;
+}
+
 static int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
 				  struct iocb *iocb));
 static int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -991,8 +1147,7 @@
 {
 	struct kiocb *req;
 	struct file *file;
-	ssize_t ret;
-	char *buf;
+	long ret;
 
 	/* enforce forwards compatibility on users */
 	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1033,50 +1188,34 @@
 	req->ki_user_data = iocb->aio_data;
 	req->ki_pos = iocb->aio_offset;
 
-	buf = (char *)(unsigned long)iocb->aio_buf;
+	req->ki_buf = (char *)(unsigned long)iocb->aio_buf;
+	req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+	req->ki_opcode = iocb->aio_lio_opcode;
+	req->ki_retry = aio_process_iocb;
+	init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
+	INIT_LIST_HEAD(&req->ki_wait.task_list);
+	req->ki_retried = 0;
+	kiocbSetStarted(req);
 
-	switch (iocb->aio_lio_opcode) {
-	case IOCB_CMD_PREAD:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_READ)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_read)
-			ret = file->f_op->aio_read(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
-		break;
-	case IOCB_CMD_PWRITE:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_WRITE)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_write)
-			ret = file->f_op->aio_write(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
-		break;
-	case IOCB_CMD_FDSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 1);
-		break;
-	case IOCB_CMD_FSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 0);
-		break;
-	default:
-		dprintk("EINVAL: io_submit: no operation provided\n");
-		ret = -EINVAL;
-	}
+	ret = aio_process_iocb(req);
+
+	if (likely(-EIOCBQUEUED == ret)) {
+		int run = 0;
+
+		spin_lock_irq(&ctx->ctx_lock);
+		kiocbClearStarted(req);
+		if (kiocbIsKicked(req))
+			run =__queue_kicked_iocb(req);
+		spin_unlock_irq(&ctx->ctx_lock);
+		if (run)
+			queue_work(aio_wq, &ctx->wq);
 
-	if (likely(-EIOCBQUEUED == ret))
 		return 0;
+	}
+
+	if ((-EBADF == ret) || (-EFAULT == ret))
+		goto out_put_req;
+
 	aio_complete(req, ret, 0);
 	return 0;
 
diff -ur linux-2.5.62/include/linux/aio.h linux-2.5.62-aio/include/linux/aio.h
--- linux-2.5.62/include/linux/aio.h	Tue Feb 18 04:25:50 2003
+++ linux-2.5.62-aio/include/linux/aio.h	Mon Mar  3 12:17:12 2003
@@ -29,21 +29,26 @@
 #define KIF_LOCKED		0
 #define KIF_KICKED		1
 #define KIF_CANCELLED		2
+#define KIF_STARTED		3
 
 #define kiocbTryLock(iocb)	test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbTryKick(iocb)	test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbTryStart(iocb)	test_and_set_bit(KIF_STARTED, &(iocb)->ki_flags)
 
 #define kiocbSetLocked(iocb)	set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)	set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)	set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbSetStarted(iocb)	set_bit(KIF_STARTED, &(iocb)->ki_flags)
 
-#define kiocbClearLocked(iocb)	set_bit(KIF_LOCKED, &(iocb)->ki_flags)
-#define kiocbClearKicked(iocb)	set_bit(KIF_KICKED, &(iocb)->ki_flags)
-#define kiocbClearCancelled(iocb)	set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbClearLocked(iocb)	clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
+#define kiocbClearKicked(iocb)	clear_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbClearCancelled(iocb)	clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbClearStarted(iocb)	clear_bit(KIF_STARTED, &(iocb)->ki_flags)
 
 #define kiocbIsLocked(iocb)	test_bit(0, &(iocb)->ki_flags)
 #define kiocbIsKicked(iocb)	test_bit(1, &(iocb)->ki_flags)
 #define kiocbIsCancelled(iocb)	test_bit(2, &(iocb)->ki_flags)
+#define kiocbIsStarted(iocb)	test_bit(3, &(iocb)->ki_flags)
 
 struct kiocb {
 	struct list_head	ki_run_list;
@@ -62,6 +67,14 @@
 	void			*ki_user_obj;	/* pointer to userland's iocb */
 	__u64			ki_user_data;	/* user's data for completion */
 	loff_t			ki_pos;
+	
+	/* State that we remember to be able to restart/retry  */
+	unsigned short		ki_opcode;
+	size_t			ki_nbytes; 	/* copy of iocb->aio_nbytes */
+	char 			*ki_buf;	/* remaining iocb->aio_buf */
+	size_t			ki_left; 	/* remaining bytes */
+	wait_queue_t		ki_wait;
+	long			ki_retried; 	/* just for testing */
 
 	char			private[KIOCB_PRIVATE_SIZE];
 };
@@ -77,6 +90,8 @@
 		(x)->ki_ctx = &tsk->active_mm->default_kioctx;	\
 		(x)->ki_cancel = NULL;			\
 		(x)->ki_user_obj = tsk;			\
+		(x)->ki_user_data = 0;			\
+		init_wait((&(x)->ki_wait));		\
 	} while (0)
 
 #define AIO_RING_MAGIC			0xa10a10a1
diff -ur linux-2.5.62/include/linux/init_task.h linux-2.5.62-aio/include/linux/init_task.h
--- linux-2.5.62/include/linux/init_task.h	Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/init_task.h	Thu Feb 27 19:01:39 2003
@@ -101,6 +101,7 @@
 	.alloc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.iocb		= NULL,						\
 }
 
 
diff -ur linux-2.5.62/include/linux/sched.h linux-2.5.62-aio/include/linux/sched.h
--- linux-2.5.62/include/linux/sched.h	Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/sched.h	Thu Feb 27 19:01:39 2003
@@ -418,6 +418,8 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+/* current aio handle */
+	struct kiocb *iocb;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
diff -ur linux-2.5.62/kernel/fork.c linux-2.5.62-aio/kernel/fork.c
--- linux-2.5.62/kernel/fork.c	Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/kernel/fork.c	Tue Mar  4 14:58:44 2003
@@ -128,6 +128,10 @@
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
 		__add_wait_queue(q, wait);
+	else {
+		if (current->iocb && (wait == &current->iocb->ki_wait))
+			printk("prepare_to_wait: iocb->ki_wait in use\n");
+	}
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 
@@ -834,6 +838,7 @@
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
+	p->iocb = NULL;
 
 	retval = -ENOMEM;
 	if (security_task_alloc(p))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-05  9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
  2003-03-05  9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
@ 2003-03-05  9:30 ` Suparna Bhattacharya
  2003-03-05 10:42   ` Andrew Morton
  2003-03-05 23:00 ` [RFC][Patch] Retry based aio read for filesystems Janet Morgan
  2 siblings, 1 reply; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05  9:30 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-aio, linux-kernel

On Wed, Mar 05, 2003 at 02:47:54PM +0530, Suparna Bhattacharya wrote:
> For the last few days I've been playing with prototyping 
> a particular flavour of a retry based implementation for 
> filesystem aio read.

# aioread.patch  : Modifications for aio read in 
# particular (generic_file_aio_read)


diff -ur linux-2.5.62/include/linux/pagemap.h linux-2.5.62-aio/include/linux/pagemap.h
--- linux-2.5.62/include/linux/pagemap.h	Tue Feb 18 04:25:49 2003
+++ linux-2.5.62-aio/include/linux/pagemap.h	Thu Feb 27 19:01:39 2003
@@ -93,6 +93,16 @@
 	if (TestSetPageLocked(page))
 		__lock_page(page);
 }
+
+extern int FASTCALL(__aio_lock_page(struct page *page));
+static inline int aio_lock_page(struct page *page)
+{
+	if (TestSetPageLocked(page))
+		return __aio_lock_page(page);
+	else
+		return 0;
+}
+
 	
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
@@ -113,6 +123,15 @@
 		wait_on_page_bit(page, PG_locked);
 }
 
+extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
+static inline int aio_wait_on_page_locked(struct page *page)
+{
+	if (PageLocked(page))
+		return aio_wait_on_page_bit(page, PG_locked);
+	else
+		return 0;
+}
+
 /* 
  * Wait for a page to complete writeback
  */
diff -ur linux-2.5.62/mm/filemap.c linux-2.5.62-aio/mm/filemap.c
--- linux-2.5.62/mm/filemap.c	Tue Feb 18 04:26:11 2003
+++ linux-2.5.62-aio/mm/filemap.c	Mon Mar  3 19:32:40 2003
@@ -268,6 +268,43 @@
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+int aio_schedule(void)
+{
+	if (!current->iocb) {
+		io_schedule();
+		return 0;
+	} else {
+		pr_debug("aio schedule");
+		return -EIOCBQUEUED;
+	}
+}
+
+int aio_wait_on_page_bit(struct page *page, int bit_nr)
+{
+	wait_queue_head_t *waitqueue = page_waitqueue(page);
+	DEFINE_WAIT(sync_wait);
+	wait_queue_t *wait = &sync_wait;
+	int state = TASK_UNINTERRUPTIBLE;
+		
+	if (current->iocb) {
+		wait = &current->iocb->ki_wait;
+		state = TASK_RUNNING;
+	}
+
+	do {
+		prepare_to_wait(waitqueue, wait, state);
+		if (test_bit(bit_nr, &page->flags)) {
+			sync_page(page);
+			if (-EIOCBQUEUED == aio_schedule())
+				return -EIOCBQUEUED;
+		}
+	} while (test_bit(bit_nr, &page->flags));
+	finish_wait(waitqueue, wait);
+
+	return 0;
+}
+EXPORT_SYMBOL(aio_wait_on_page_bit);
+
 /**
  * unlock_page() - unlock a locked page
  *
@@ -336,6 +373,31 @@
 }
 EXPORT_SYMBOL(__lock_page);
 
+int __aio_lock_page(struct page *page)
+{
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	DEFINE_WAIT(sync_wait);
+	wait_queue_t *wait = &sync_wait;
+	int state = TASK_UNINTERRUPTIBLE;
+		
+	if (current->iocb) {
+		wait = &current->iocb->ki_wait;
+		state = TASK_RUNNING;
+	}
+
+	while (TestSetPageLocked(page)) {
+		prepare_to_wait(wqh, wait, state);
+		if (PageLocked(page)) {
+			sync_page(page);
+			if (-EIOCBQUEUED == aio_schedule())
+				return -EIOCBQUEUED;
+		}
+	}
+	finish_wait(wqh, wait);
+	return 0;
+}
+EXPORT_SYMBOL(__aio_lock_page);
+
 /*
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
@@ -614,7 +676,13 @@
 			goto page_ok;
 
 		/* Get exclusive access to the page ... */
-		lock_page(page);
+		
+		if (aio_lock_page(page)) {
+			pr_debug("queued lock page \n");
+			error = -EIOCBQUEUED;
+			/* TBD: should we hold on to the cached page ? */
+			goto sync_error;
+		}
 
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
@@ -636,12 +704,18 @@
 		if (!error) {
 			if (PageUptodate(page))
 				goto page_ok;
-			wait_on_page_locked(page);
+			if (aio_wait_on_page_locked(page)) {
+				pr_debug("queued wait_on_page \n");
+				error = -EIOCBQUEUED;
+				/*TBD:should we hold on to the cached page ?*/
+				goto sync_error;
+			}
 			if (PageUptodate(page))
 				goto page_ok;
 			error = -EIO;
 		}
 
+sync_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
 		page_cache_release(page);
@@ -813,6 +887,7 @@
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
+	BUG_ON(current->iocb != NULL);
 	ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
@@ -844,6 +919,7 @@
 {
 	read_descriptor_t desc;
 
+	BUG_ON(current->iocb != NULL);
 	if (!count)
 		return 0;
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-05  9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
@ 2003-03-05 10:42   ` Andrew Morton
  2003-03-05 12:14     ` Suparna Bhattacharya
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2003-03-05 10:42 UTC (permalink / raw)
  To: suparna; +Cc: bcrl, linux-aio, linux-kernel

Suparna Bhattacharya <suparna@in.ibm.com> wrote:
>
> +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
> +static inline int aio_wait_on_page_locked(struct page *page)

Oh boy.

There are soooo many more places where we can block:

- write() -> down(&inode->i_sem)

- read()/write() -> read indirect block -> wait_on_buffer()

- read()/write() -> read bitmap block -> wait_on_buffer()

- write() -> allocate block -> mark_buffer_dirty() ->
	balance_dirty_pages() -> writer throttling

- write() -> allocate block -> journal_get_write_access()

- read()/write() -> update_a/b/ctime() -> journal_get_write_access()

- ditto for other journalling filesystems

- read()/write() -> anything -> get_request_wait()
  (This one can be avoided by polling the backing_dev_info congestion
   flags)

- read()/write() -> page allocator -> blk_congestion_wait()

- write() -> balance_dirty_pages() -> writer throttling

- probably others.

Now, I assume that what you're looking for here is an 80% solution, but it
seems that a lot more changes will be needed to get even that far.

And given that a single kernel thread per spindle can easily keep that
spindle saturated all the time, one does wonder "why try to do it this way at
all"?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-05 10:42   ` Andrew Morton
@ 2003-03-05 12:14     ` Suparna Bhattacharya
  2003-03-31 18:32       ` Janet Morgan
  0 siblings, 1 reply; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05 12:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: bcrl, linux-aio, linux-kernel

On Wed, Mar 05, 2003 at 02:42:54AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> >
> > +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
> > +static inline int aio_wait_on_page_locked(struct page *page)
> 
> Oh boy.
> 
> There are soooo many more places where we can block:

Oh yes, there are lots (I'm simply shutting my eyes 
to them till we've conquered at least a few :))

What I'm trying to do, is to look at them one at a time, 
from the ones that have the greatest potential of benefits/
improvement based on impact and complexity.

Otherwise it'd be just too hard to get anywhere !

Actually that was one reason why I decided to post
this early. Want to catch the most important ones
and make sure we can deal with them at least.

Read is the easier case, which is why I started with
it. Will come back to write case sometime later after
I've played with a bit.

Even with read there is one more case besides your
list. The copy_to_user or fault_in_writable_pages can 
block too ;)  (that's what I'm running into) ..

However, the general idea is that any point of blocking 
where it is possible for things to just work if we return 
instead of waiting, and issue a retry that would
continue from the point where it left off, can be
handled within the basic framework. What we need to do
is to look at these on a case by case basis and see
if that is doable. My hunch is that for congestion and
throttling points it should be possible. And we already
have a lot of pipelining and restartability built
into the VFS now.  

Mostly we just need to be able to make sure the 
-EIOCBQUEUED returns can be propagated all the way up,
without breaking anything.

Its really the meta-data related waits that are a
black box for me, and wasn't planning on tackling yet ... 
More so as I guess it could mean getting into very 
filesystem specific territory so doing it consistently
may not be that easy.


> 
> - write() -> down(&inode->i_sem)
> 
> - read()/write() -> read indirect block -> wait_on_buffer()
> 
> - read()/write() -> read bitmap block -> wait_on_buffer()
> 
> - write() -> allocate block -> mark_buffer_dirty() ->
> 	balance_dirty_pages() -> writer throttling
> 
> - write() -> allocate block -> journal_get_write_access()
> 
> - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
> 
> - ditto for other journalling filesystems
> 
> - read()/write() -> anything -> get_request_wait()
>   (This one can be avoided by polling the backing_dev_info congestion
>    flags)

I was thinking of a get_request_async_wait() that unplugs
the queue, and returns -EIOCBQUEUED after queueing the 
async waiter to just retry the operation. 

Of course this would work if the caller is able to push 
back -EIOCBQUEUED without breaking anything in a 
non-startable way.

> 
> - read()/write() -> page allocator -> blk_congestion_wait()
> 
> - write() -> balance_dirty_pages() -> writer throttling
> 
> - probably others.
>  
> Now, I assume that what you're looking for here is an 80% solution, but it
> seems that a lot more changes will be needed to get even that far.

If we don't bother about meta-data and indirect blocks 
just yet, wouldn't the gains we get otherwise not be
worth it ?

> 
> And given that a single kernel thread per spindle can easily keep that
> spindle saturated all the time, one does wonder "why try to do it this way at
> all"?

Need some more explanation on how/where you really break up
a generic_file_aio_read operation (with the page_cache_read, 
readpage, copy_to_user) on a per-spindle basis. Aren't we at
a higher level of abstraction compared to disk at this stage ?
I can visualize delegating all readpage calls to a worker
thread per-backing device or something like that (forgetting
LVM/RAID for a while), but what about the rest of the parts ?

Are you suggesting restructuring the generic_file_aio_read
code to separate out these stages, so it can identify where
it is and handoff itself accordingly to the right worker ?

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][Patch] Retry based aio read for filesystems
  2003-03-05  9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
  2003-03-05  9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
  2003-03-05  9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
@ 2003-03-05 23:00 ` Janet Morgan
  2 siblings, 0 replies; 15+ messages in thread
From: Janet Morgan @ 2003-03-05 23:00 UTC (permalink / raw)
  To: suparna; +Cc: bcrl, akpm, linux-aio, linux-kernel, lse-tech

Suparna Bhattacharya wrote:

> For the last few days I've been playing with prototyping
> a particular flavour of a retry based implementation for
> filesystem aio read.

Hi Suparna,

I ran an aio-enabled version of fsx on your patches and no errors were
reported.  I plan on using gcov to determine the test coverage I
obtained.

I also did a quick performance test using wall time to compare sync and
async read operations with your patches applied.  For the sync case I
ran
10 processes in parallel, each reading 1GB in 1MB chunks from a
per-process
dedicated device.  I compared that to a single aio process iteratively
calling
io_submit/io_getevents for up to 10 iocbs/events where each iocb
specified a
1MB read to its dedicated device until 1GB was read.  Whew!

The result was that wall time for the sync and async testcases were
consistently identical, i.e., 1m30s:

# sync_test
start time:  Wed Mar  5 14:02:05 PST 2003
end time:    Wed Mar  5 14:03:35 PST 2003

# aio_test
start time:  Wed Mar  5 13:52:04 PST 2003
end time:    Wed Mar  5 13:53:34 PST 2003

syncio vmstat:
  procs                      memory      swap          io
system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id
 4  6  2   3324   3888  11356 3668848   0    0 151296    0 2216  2395  0
100  0
 4  6  1   3324   3888  11368 3668820   0    0 151744    5 2216  2368  0
100  0
 1  9  1   3324   3952  11368 3668860   0    0 151209    1 2213  2356  0
100  0
 5  5  1   3324   3940  11344 3668864   0    2 148948    3 2215  2387  0
100  0
 1  9  1   3348   3936  11264 3668968   0    6 150767    7 2209  2345  0
100  0
 6  4  1   3480   3920  11192 3669364   0   33 151456   33 2218  2340  0
100  0
 4  6  2   3568   3896  11316 3669352   0   21 151887   21 2218  2385  0
100  0
 7  3  1   3704   3820  11364 3669428  31   34 148687   35 2222  2344
1  99  0

aio vmstat:
  procs                      memory      swap          io
system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id
 2  0  1   4016   4040  11192 3669644   0   17 133152    25 2073   502
0 40 60
 1  0  1   4016   4104  11196 3669716   0    0 132288     1 2073   537
0 40 60
 2  0  2   4016   4104  11196 3669764   0    0 132416     0 2067   511
0 40 60
 1  0  1   4016   4104  11200 3669788   0    0 133088     1 2075   523
0 41 59
 1  0  1   4016   5576  11200 3668240   0    0 132384     0 2066   526
0 40 60
 1  0  1   4036   4092  11200 3669756   0    5 135116     5 2094   492
0 46 54
 1  0  1   4180   4016  11192 3669944   0   36 135968    40 2111   499
0 46 54
 2  0  2   4180   4060  11176 3669832   0    0 137152     0 2119   463
1 46 53
 1  0  1   4180   7060  11180 3666980   0    0 136384     2 2107   498
0 44 56


-Janet


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 1/2] Retry based aio read - core aio changes
  2003-03-05  9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
@ 2003-03-14 13:23   ` Suparna Bhattacharya
  0 siblings, 0 replies; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-14 13:23 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-aio, linux-kernel, lse-tech

On Wed, Mar 05, 2003 at 02:56:33PM +0530, Suparna Bhattacharya wrote:
> On Wed, Mar 05, 2003 at 02:47:54PM +0530, Suparna Bhattacharya wrote:
> > For the last few days I've been playing with prototyping 
> > a particular flavour of a retry based implementation for 
> > filesystem aio read.
> 
>  
> # aioretry.patch : Core aio infrastructure modifications
> 		   for high-level retry based aio

Ben pointed out that at least, we shouldn't be duplicating 
the switch on every retry. So here's another take which 
cleans things up a bit as well. 

It is still very high-level (would like to experiment with 
it a little further to find out how it behaves/performs in 
practice).

(I've checked that that patch applies to 2.5.64-bk8)

Earlier today I posted separate patches for the 
fixes listed below, so they are no longer part of this 
patch.

Comments/feedback welcome.

> 
> Includes couple of fixes in the aio code
> -The kiocbClearXXX were doing a set_bit instead of 
>  clear_bit
> -Sync iocbs were not woken up when iocb->ki_users = 1
>  (dio takes a different path for sync and async iocbs,
>  so maybe that's why we weren't seeing the problem yet)
> 

Regards
Suparna


diff -ur linux-2.5.62/fs/aio.c linux-2.5.62-aio/fs/aio.c
--- linux-2.5.62/fs/aio.c	Tue Feb 18 04:26:14 2003
+++ linux-2.5.62-aio/fs/aio.c	Tue Mar 11 21:07:35 2003
@@ -395,6 +396,7 @@
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 	req->ki_user_obj = NULL;
+	INIT_LIST_HEAD(&req->ki_run_list);
 
 	/* Check if the completion queue has enough free space to
 	 * accept an event from this io.
@@ -558,46 +560,121 @@
 	enter_lazy_tlb(mm, current, smp_processor_id());
 }
 
-/* Run on kevent's context.  FIXME: needs to be per-cpu and warn if an
- * operation blocks.
- */
-static void aio_kick_handler(void *data)
+static inline int __queue_kicked_iocb(struct kiocb *iocb)
 {
-	struct kioctx *ctx = data;
+	struct kioctx	*ctx = iocb->ki_ctx;
 
-	use_mm(ctx->mm);
+	if (list_empty(&iocb->ki_run_list)) {
+		list_add_tail(&iocb->ki_run_list, 
+			&ctx->run_list);
+		return 1;
+	}
+	return 0;
+}
 
-	spin_lock_irq(&ctx->ctx_lock);
-	while (!list_empty(&ctx->run_list)) {
-		struct kiocb *iocb;
-		long ret;
+/* Expects to be called with iocb->ki_ctx->lock held */
+static ssize_t aio_run_iocb(struct kiocb *iocb)
+{
+	struct kioctx	*ctx = iocb->ki_ctx;
+	ssize_t (*retry)(struct kiocb *);
+	ssize_t ret;
 
-		iocb = list_entry(ctx->run_list.next, struct kiocb,
-				  ki_run_list);
-		list_del(&iocb->ki_run_list);
-		iocb->ki_users ++;
-		spin_unlock_irq(&ctx->ctx_lock);
+	if (iocb->ki_retried++ > 1024*1024) {
+		printk("Maximal retry count. Bytes done %d\n",
+			iocb->ki_nbytes - iocb->ki_left);
+		return -EAGAIN;
+	}
+
+	if (!(iocb->ki_retried & 0xff)) {
+		printk("%ld aio retries completed %d bytes of %d\n",
+			iocb->ki_retried, 
+			iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
+	}
+
+	if (!(retry = iocb->ki_retry))
+		return 0;
+
+	iocb->ki_users ++;
+	kiocbClearKicked(iocb);
+	iocb->ki_retry = NULL;	
+	spin_unlock_irq(&ctx->ctx_lock);
+	
+	BUG_ON(current->iocb != NULL);
+	
+	current->iocb = iocb;
+	ret = retry(iocb);
+	current->iocb = NULL;
 
-		kiocbClearKicked(iocb);
-		ret = iocb->ki_retry(iocb);
-		if (-EIOCBQUEUED != ret) {
+	if (-EIOCBQUEUED != ret) {
+		if (list_empty(&iocb->ki_wait.task_list)) 
 			aio_complete(iocb, ret, 0);
-			iocb = NULL;
-		}
+		else
+			printk("can't delete iocb in use\n");
+	} else {
+		if (list_empty(&iocb->ki_wait.task_list)) 
+			kiocbSetKicked(iocb);
+	}
+	spin_lock_irq(&ctx->ctx_lock);
 
-		spin_lock_irq(&ctx->ctx_lock);
-		if (NULL != iocb)
-			__aio_put_req(ctx, iocb);
+	iocb->ki_retry = retry;
+	INIT_LIST_HEAD(&iocb->ki_run_list);
+	if (kiocbIsKicked(iocb)) {
+		BUG_ON(ret != -EIOCBQUEUED);
+		__queue_kicked_iocb(iocb);
+	} 
+	__aio_put_req(ctx, iocb);
+	return ret;
+}
+
+static void aio_run_iocbs(struct kioctx *ctx)
+{
+	struct kiocb *iocb;
+	ssize_t ret;
+
+	spin_lock_irq(&ctx->ctx_lock);
+	while (!list_empty(&ctx->run_list)) {
+		iocb = list_entry(ctx->run_list.next, struct kiocb,
+			ki_run_list);
+		list_del(&iocb->ki_run_list);
+		ret = aio_run_iocb(iocb);
 	}
 	spin_unlock_irq(&ctx->ctx_lock);
+}
+
+/* Run on aiod/kevent's context.  FIXME: needs to be per-cpu and warn if an
+ * operation blocks.
+ */
+static void aio_kick_handler(void *data)
+{
+	struct kioctx *ctx = data;
 
+	use_mm(ctx->mm);
+	aio_run_iocbs(ctx);
 	unuse_mm(ctx->mm);
 }
 
-void kick_iocb(struct kiocb *iocb)
+
+void queue_kicked_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
+	unsigned long flags;
+	int run = 0;
+
+	WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
+
+	spin_lock_irqsave(&ctx->ctx_lock, flags);
+	run = __queue_kicked_iocb(iocb);
+	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+	if (run) {
+		if (waitqueue_active(&ctx->wait))
+			wake_up(&ctx->wait);
+		else
+			queue_work(aio_wq, &ctx->wq);
+	}
+}
 
+void kick_iocb(struct kiocb *iocb)
+{
 	/* sync iocbs are easy: they can only ever be executing from a 
 	 * single context. */
 	if (is_sync_kiocb(iocb)) {
@@ -607,11 +684,9 @@
 	}
 
 	if (!kiocbTryKick(iocb)) {
-		unsigned long flags;
-		spin_lock_irqsave(&ctx->ctx_lock, flags);
-		list_add_tail(&iocb->ki_run_list, &ctx->run_list);
-		spin_unlock_irqrestore(&ctx->ctx_lock, flags);
-		schedule_work(&ctx->wq);
+		queue_kicked_iocb(iocb);
+	} else {
+		pr_debug("iocb already kicked or in progress\n");
 	}
 }
 
@@ -664,6 +739,9 @@
 	 */
 	spin_lock_irqsave(&ctx->ctx_lock, flags);
 
+	if (!list_empty(&iocb->ki_run_list))
+		list_del_init(&iocb->ki_run_list);
+
 	ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
 
 	tail = info->tail;
@@ -865,6 +943,8 @@
 			ret = 0;
 			if (to.timed_out)	/* Only check after read evt */
 				break;
+			/* accelerate kicked iocbs for this ctx */	
+			aio_run_iocbs(ctx);
 			schedule();
 			if (signal_pending(tsk)) {
 				ret = -EINTR;
@@ -984,6 +1064,149 @@
 	return -EINVAL;
 }
 
+ssize_t aio_pread(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+
+	ret = file->f_op->aio_read(iocb, iocb->ki_buf,
+		iocb->ki_left, iocb->ki_pos);
+
+	pr_debug("aio_pread: fop ret %d\n", ret);
+
+	/*
+	 * Can't just depend on iocb->ki_left to determine 
+	 * whether we are done. This may have been a short read.
+	 */
+	if (ret > 0) {
+		iocb->ki_buf += ret;
+		iocb->ki_left -= ret;
+
+		ret = -EIOCBQUEUED;
+	}
+
+	/* This means we must have transferred all that we could */
+	/* No need to retry anymore */
+	if (ret == 0) 
+		ret = iocb->ki_nbytes - iocb->ki_left;
+
+	return ret;
+}
+
+ssize_t aio_pwrite(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+
+	ret = file->f_op->aio_write(iocb, iocb->ki_buf,
+		iocb->ki_left, iocb->ki_pos);
+
+	pr_debug("aio_pread: fop ret %d\n", ret);
+
+	/* 
+	 * TBD: Even if iocb->ki_left = 0, could we need to 
+	 * wait for data to be sync'd ? Or can we assume
+	 * that aio_fdsync/aio_fsync would be called explicitly
+	 * as required.
+	 */
+	if (ret > 0) {
+		iocb->ki_buf += ret;
+		iocb->ki_left -= ret;
+
+		ret = -EIOCBQUEUED;
+	}
+
+	/* This means we must have transferred all that we could */
+	/* No need to retry anymore */
+	if (ret == 0) 
+		ret = iocb->ki_nbytes - iocb->ki_left;
+
+	return ret;
+}
+
+ssize_t aio_fdsync(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = -EINVAL;
+
+	if (file->f_op->aio_fsync)
+		ret = file->f_op->aio_fsync(iocb, 1);
+	return ret;
+}
+	
+ssize_t aio_fsync(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = -EINVAL;
+
+	if (file->f_op->aio_fsync)
+		ret = file->f_op->aio_fsync(iocb, 0);
+	return ret;
+}
+	
+/* Called during initial submission and subsequent retry operations */
+ssize_t aio_setup_iocb(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+	
+	switch (iocb->ki_opcode) {
+	case IOCB_CMD_PREAD:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_READ)))
+			break;
+		ret = -EFAULT;
+		if (unlikely(!access_ok(VERIFY_WRITE, iocb->ki_buf, 
+			iocb->ki_left)))
+			break;
+		ret = -EINVAL;
+		if (file->f_op->aio_read)
+			iocb->ki_retry = aio_pread;
+		break;
+	case IOCB_CMD_PWRITE:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_WRITE)))
+			break;
+		ret = -EFAULT;
+		if (unlikely(!access_ok(VERIFY_READ, iocb->ki_buf, 
+			iocb->ki_left)))
+			break;
+		ret = -EINVAL;
+		if (file->f_op->aio_write)
+			iocb->ki_retry = aio_pwrite;
+		break;
+	case IOCB_CMD_FDSYNC:
+		ret = -EINVAL;
+		if (file->f_op->aio_fsync)
+			iocb->ki_retry = aio_fdsync;
+		break;
+	case IOCB_CMD_FSYNC:
+		ret = -EINVAL;
+		if (file->f_op->aio_fsync)
+			iocb->ki_retry = aio_fsync;
+		break;
+	default:
+		dprintk("EINVAL: io_submit: no operation provided\n");
+		ret = -EINVAL;
+	}
+
+	if (!iocb->ki_retry)
+		return ret;
+
+	pr_debug("ki_pos = %llu\n", iocb->ki_pos);
+
+	return 0;
+}
+
+int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+{
+	struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+
+	list_del_init(&wait->task_list);
+	kick_iocb(iocb);
+	return 1;
+}
+
 static int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
 				  struct iocb *iocb));
 static int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -992,7 +1215,6 @@
 	struct kiocb *req;
 	struct file *file;
 	ssize_t ret;
-	char *buf;
 
 	/* enforce forwards compatibility on users */
 	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1033,51 +1255,27 @@
 	req->ki_user_data = iocb->aio_data;
 	req->ki_pos = iocb->aio_offset;
 
-	buf = (char *)(unsigned long)iocb->aio_buf;
+	req->ki_buf = (char *)(unsigned long)iocb->aio_buf;
+	req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+	req->ki_opcode = iocb->aio_lio_opcode;
+	init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
+	INIT_LIST_HEAD(&req->ki_wait.task_list);
+	req->ki_run_list.next = req->ki_run_list.prev = NULL;
+	req->ki_retry = NULL;
+	req->ki_retried = 0;
 
-	switch (iocb->aio_lio_opcode) {
-	case IOCB_CMD_PREAD:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_READ)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_read)
-			ret = file->f_op->aio_read(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
-		break;
-	case IOCB_CMD_PWRITE:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_WRITE)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_write)
-			ret = file->f_op->aio_write(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
-		break;
-	case IOCB_CMD_FDSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 1);
-		break;
-	case IOCB_CMD_FSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 0);
-		break;
-	default:
-		dprintk("EINVAL: io_submit: no operation provided\n");
-		ret = -EINVAL;
-	}
+	ret = aio_setup_iocb(req);
+
+	if ((-EBADF == ret) || (-EFAULT == ret))
+		goto out_put_req;
+
+	spin_lock_irq(&ctx->ctx_lock);
+	ret = aio_run_iocb(req);
+	spin_unlock_irq(&ctx->ctx_lock);
+
+	if (-EIOCBQUEUED == ret)
+		queue_work(aio_wq, &ctx->wq);
 
-	if (likely(-EIOCBQUEUED == ret))
-		return 0;
-	aio_complete(req, ret, 0);
 	return 0;
 
 out_put_req:
diff -ur linux-2.5.62/include/linux/aio.h linux-2.5.62-aio/include/linux/aio.h
--- linux-2.5.62/include/linux/aio.h	Tue Feb 18 04:25:50 2003
+++ linux-2.5.62-aio/include/linux/aio.h	Tue Mar 11 21:31:22 2003
@@ -54,7 +54,7 @@
 	struct file		*ki_filp;
 	struct kioctx		*ki_ctx;	/* may be NULL for sync ops */
 	int			(*ki_cancel)(struct kiocb *, struct io_event *);
-	long			(*ki_retry)(struct kiocb *);
+	ssize_t			(*ki_retry)(struct kiocb *);
 
 	struct list_head	ki_list;	/* the aio core uses this
 						 * for cancellation */
@@ -62,6 +62,14 @@
 	void			*ki_user_obj;	/* pointer to userland's iocb */
 	__u64			ki_user_data;	/* user's data for completion */
 	loff_t			ki_pos;
+	
+	/* State that we remember to be able to restart/retry  */
+	unsigned short		ki_opcode;
+	size_t			ki_nbytes; 	/* copy of iocb->aio_nbytes */
+	char 			*ki_buf;	/* remaining iocb->aio_buf */
+	size_t			ki_left; 	/* remaining bytes */
+	wait_queue_t		ki_wait;
+	long			ki_retried; 	/* just for testing */
 
 	char			private[KIOCB_PRIVATE_SIZE];
 };
@@ -77,6 +85,8 @@
 		(x)->ki_ctx = &tsk->active_mm->default_kioctx;	\
 		(x)->ki_cancel = NULL;			\
 		(x)->ki_user_obj = tsk;			\
+		(x)->ki_user_data = 0;			\
+		init_wait((&(x)->ki_wait));		\
 	} while (0)
 
 #define AIO_RING_MAGIC			0xa10a10a1
diff -ur linux-2.5.62/include/linux/init_task.h linux-2.5.62-aio/include/linux/init_task.h
--- linux-2.5.62/include/linux/init_task.h	Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/init_task.h	Thu Feb 27 19:01:39 2003
@@ -101,6 +101,7 @@
 	.alloc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.iocb		= NULL,						\
 }
 
 
diff -ur linux-2.5.62/include/linux/sched.h linux-2.5.62-aio/include/linux/sched.h
--- linux-2.5.62/include/linux/sched.h	Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/sched.h	Thu Feb 27 19:01:39 2003
@@ -418,6 +418,8 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+/* current aio handle */
+	struct kiocb *iocb;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
diff -ur linux-2.5.62/kernel/fork.c linux-2.5.62-aio/kernel/fork.c
--- linux-2.5.62/kernel/fork.c	Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/kernel/fork.c	Tue Mar  4 14:58:44 2003
@@ -834,6 +838,7 @@
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
+	p->iocb = NULL;
 
 	retval = -ENOMEM;
 	if (security_task_alloc(p))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-05 12:14     ` Suparna Bhattacharya
@ 2003-03-31 18:32       ` Janet Morgan
  2003-03-31 19:11         ` William Lee Irwin III
  0 siblings, 1 reply; 15+ messages in thread
From: Janet Morgan @ 2003-03-31 18:32 UTC (permalink / raw)
  To: akpm; +Cc: suparna, bcrl, linux-aio, linux-kernel

> On Wed, Mar 05, 2003 at 02:42:54AM -0800, Andrew Morton wrote:
>  > Suparna Bhattacharya <suparna@in.ibm.com> wrote:
>  >
>  > +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
>  > +static inline int aio_wait_on_page_locked(struct page *page)
>
>  Oh boy.
>
>  There are soooo many more places where we can block:
>
>  - write() -> down(&inode->i_sem)
>
>  - read()/write() -> read indirect block -> wait_on_buffer()
>
>  - read()/write() -> read bitmap block -> wait_on_buffer()
>
>  - write() -> allocate block -> mark_buffer_dirty() ->
>     balance_dirty_pages() -> writer throttling
>
>  - write() -> allocate block -> journal_get_write_access()
>
>  - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
>
>  - ditto for other journalling filesystems
>
>  - read()/write() -> anything -> get_request_wait()
>    (This one can be avoided by polling the backing_dev_info congestion
>     flags)
>
>  - read()/write() -> page allocator -> blk_congestion_wait()
>
>  - write() -> balance_dirty_pages() -> writer throttling
>
>  - probably others.
>
>  Now, I assume that what you're looking for here is an 80% solution, but it
>  seems that a lot more changes will be needed to get even that far.

I'm trying to identify significant blocking points in the filesystem read
path.  I'm not sure what the best approach is, so figured I'd just track
callers of schedule() under a heavy dd workload.

I collected data while running 1000 processes, where each process was
using dd to sequentially read from a 1GB file.  I used 10 target files
in all, so 100 processes read from file1, 100 processes read from file2,
etc.  All these numbers were pretty much arbitrary.  I used stock 2.5.62
to test.

The top 3 callers of schedule based on my dd read workload are listed
below.  Together they accounted for 92% of all calls to schedule.
Suparna's filesystem aio patch already modifies 2 of these 3 callpoints
to be "retryable".  So the remaining candidate is the call to cond_resched
from do_generic_mapping_read.  Question is whether this qualifies as the
sort of thing that should cause a retry, i.e., is cond_resched a kind of
voluntary yield/preemption point which may not even result in a context
switch if there is nothing more preferable to run?   And even if the call to
cond_resched is not cause for retry, the profile data seems to indicate that
Suparna's patch is roughly an 80% solution (unless I'm missing something here,
which is entirely possible ;-).

So here are the call statistics:

  70% of all calls to schedule were from __lock_page:
        Based on the profile data, almost all calls to _lock_page were
        from do_generic_mapping_read (see filemap.c/Line 101 below).
        Suparna's patch already retries here.

  15% of all calls to schedule were from __cond_resched:
        do_generic_mapping_read -> cond_resched -> __cond_resched
        (see filemap.c/Line 41 below).
        Should this be made retryable ???

   7% of all calls to schedule were from wait_on_page_bit:
        Based on the profile data, almost all calls to wait_on_page_bit were
        from do_generic_mapping_read->wait_on_page_locked -> wait_on_page_bit
        (see filemap.c/Line 123 below).  Suparna's patch covers this
        callpoint, too.


Here's some detail....

Callers of schedule()

     7768210  Total calls to schedule
   ------------
     5440195  __lock_page+176
     1143862  do_generic_mapping_read+d6
      569139  wait_on_page_bit+17e
      488273  work_resched+5
       43414  __wait_on_buffer+14a
       23594  blk_congestion_wait+99
       21171  worker_thread+14f
        9838  cpu_idle+6d
        6308  do_select+29b
        5178  schedule_timeout+b8
        4136  sys_sched_yield+d9
        2096  kswapd+122
        1796  sleep_on+55
        1767  sys_nanosleep+f8
        1655  do_exit+4eb
        1643  wait_for_completion+d2
        1365  interruptible_sleep_on+55
        1034  pipe_wait+98
        1004  balanced_irq+6d
         975  ksoftirqd+108
         971  journal_stop+107
         884  sys_wait4+2ff
         780
         658  unix_wait_for_peer+d6
         576  read_chan+52c
         580  __pdflush+d2
         105  do_poll+ec
         102  __down+9e
          71  tty_wait_until_sent+f2
          59  ksoftirqd+b5
          55  __down_interruptible+d8
          41
          16  migration_thread+cc
          10  uart_wait_until_sent+d6
           9  unix_stream_data_wait+109
           8  tcp_data_wait+16c
           4  wait_for_packet+140
           3  wait_til_done+110
           2  write_chan+280
           2  generic_file_aio_write_nolock+bf4
           2  journal_commit_transaction+97a
           1  usb_hub_thread+c0
           1  jfs_lazycommit+1c6
           1  serio_thread+e0
           1  jfs_sync+259
           1  jfsIOWait+15d
           1  _lock_fdc+12f
           1  acpi_ex_system_do_suspend+3d
           1  init_pcmcia_ds+25e


readprofile:

10058989 total                                  2.9520
5982060 __copy_to_user_ll                       41542.0833
928000 do_generic_mapping_read                  698.7952
447968 poll_idle                                4666.3333
301263 file_read_actor                          1176.8086
265660 radix_tree_lookup                        1844.8611
209991 page_cache_readahead                     410.1387
129655 vfs_read                                 311.6707
117666 kmap_atomic                              919.2656
111874 fget                                     1165.3542
110996 __generic_file_aio_read                  182.5592
109402 kunmap_atomic                            3418.8125
 95092 vfs_write                                228.5865
 76119 mark_page_accessed                       679.6339
 69564 current_kernel_time                      724.6250
 52569 fput                                     1095.1875
 51824 update_atime                             231.3571
 51375 activate_page                            267.5781
 46577 scsi_request_fn                           61.9375
 41731 generic_file_read                        237.1080
 39300 shrink_cache                              38.9881
 37658 schedule                                  26.1514
 37501 refill_inactive_zone                      22.3220
 37325 scsi_dispatch_cmd                         70.6913
 32762 unlock_page                              292.5179
 31924 buffered_rmqueue                          86.7500
 30454 __make_request                            21.3862
 28532 do_page_cache_readahead                   54.0379
 27783 delay_tsc                                434.1094
 27723 ext2_get_branch                           69.3075
 25726 shrink_list                               13.1793
 25246 __find_get_block                          63.1150
 24443 mpage_end_io_read                        169.7431
 23800 add_to_page_cache                         87.5000
 20669 do_softirq                                71.7674
 19895 system_call                              452.1591
 18906 do_mpage_readpage                         14.4101
 18783 page_referenced                           73.3711
 18137 free_hot_cold_page                        59.6612
 18015 __wake_up                                225.1875
 13790 radix_tree_insert                         53.8672
 11227 radix_tree_delete                         31.8949
 11188 __alloc_pages                             11.4631
 10885 __brelse                                 136.0625
 10699 write_null                               668.6875
 10299 __pagevec_lru_add                         35.7604
 10100 page_address                              42.0833
  9968 ext2_get_block                             7.5976


2.5.62 filemap.c/do_generic_mapping_read:
     1  /*
     2   * This is a generic file read routine, and uses the
     3   * inode->i_op->readpage() function for the actual low-level
     4   * stuff.
     5   *
     6   * This is really ugly. But the goto's actually try to clarify some
     7   * of the logic when it comes to error handling etc.
     8   * - note the struct file * is only passed for the use of readpage
     9   */
    10  void do_generic_mapping_read(struct address_space *mapping,
    11                               struct file_ra_state *ra,
    12                               struct file * filp,
    13                               loff_t *ppos,
    14                               read_descriptor_t * desc,
    15                               read_actor_t actor)
    16  {
    17          struct inode *inode = mapping->host;
    18          unsigned long index, offset;
    19          struct page *cached_page;
    20          int error;
    21
    22          cached_page = NULL;
    23          index = *ppos >> PAGE_CACHE_SHIFT;
    24          offset = *ppos & ~PAGE_CACHE_MASK;
    25
    26          for (;;) {
    27                  struct page *page;
    28                  unsigned long end_index, nr, ret;
    29
    30                  end_index = inode->i_size >> PAGE_CACHE_SHIFT;
    31
    32                  if (index > end_index)
    33                          break;
    34                  nr = PAGE_CACHE_SIZE;
    35                  if (index == end_index) {
    36                          nr = inode->i_size & ~PAGE_CACHE_MASK;
    37                          if (nr <= offset)
    38                                  break;
    39                  }
    40
    41                  cond_resched();
    42                  page_cache_readahead(mapping, ra, filp, index);
    43
    44                  nr = nr - offset;
    45
    46                  /*
    47                   * Try to find the data in the page cache..
    48                   */
    49  find_page:
    50                  read_lock(&mapping->page_lock);
    51                  page = radix_tree_lookup(&mapping->page_tree, index);
    52                  if (!page) {
    53                          read_unlock(&mapping->page_lock);
    54                          handle_ra_miss(mapping,ra);
    55                          goto no_cached_page;
    56                  }
    57                  page_cache_get(page);
    58                  read_unlock(&mapping->page_lock);
    59
    60                  if (!PageUptodate(page))
    61                          goto page_not_up_to_date;
    62  page_ok:
    63                  /* If users can be writing to this page using arbitrary
    64                   * virtual addresses, take care about potential aliasing
    65                   * before reading the page on the kernel side.
    66                   */
    67                  if (!list_empty(&mapping->i_mmap_shared))
    68                          flush_dcache_page(page);
    69
    70                  /*
    71                   * Mark the page accessed if we read the beginning.
    72                   */
    73                  if (!offset)
    74                          mark_page_accessed(page);
    75
    76                  /*
    77                   * Ok, we have the page, and it's up-to-date, so
    78                   * now we can copy it to user space...
    79                   *
    80                   * The actor routine returns how many bytes were actually
used..
    81                   * NOTE! This may not be the same as how much of a user
buffer
    82                   * we filled up (we may be padding etc), so we can only
update
    83                   * "pos" here (the actor routine has to update the user
buffer
    84                   * pointers and the remaining count).
    85                   */  86                  ret = actor(desc, page, offset,
nr);
    87                  offset += ret;
    88                  index += offset >> PAGE_CACHE_SHIFT;
    89                  offset &= ~PAGE_CACHE_MASK;
    90
    91                  page_cache_release(page);
    92                  if (ret == nr && desc->count)
    93                          continue;
    94                  break;
    95
    96  page_not_up_to_date:
    97                  if (PageUptodate(page))
    98                          goto page_ok;
    99
   100                  /* Get exclusive access to the page ... */
   101                  lock_page(page);
   102
   103                  /* Did it get unhashed before we got the lock? */
   104                  if (!page->mapping) {
   105                          unlock_page(page);
   106                          page_cache_release(page);
   107                          continue;
   108                  }
   109
   110                  /* Did somebody else fill it already? */
   111                  if (PageUptodate(page)) {
   112                          unlock_page(page);
   113                          goto page_ok;
   114                  }
   115
   116  readpage:
   117                  /* ... and start the actual read. The read will unlock the
page. */
   118                  error = mapping->a_ops->readpage(filp, page);
   119
   120                  if (!error) {
   121                          if (PageUptodate(page))
   122                                  goto page_ok;
   123                          wait_on_page_locked(page);
   124                          if (PageUptodate(page))
   125                                  goto page_ok;
   126                          error = -EIO;
   127                  }
   128
   129                  /* UHHUH! A synchronous read error occurred. Report it */
   130                  desc->error = error;
   131                  page_cache_release(page);
   132                  break;
   133
   134  no_cached_page:
   135                  /*
   136                   * Ok, it wasn't cached, so we need to create a new
   137                   * page..
   138                   */
   139                  if (!cached_page) {
   140                          cached_page = page_cache_alloc_cold(mapping);
   141                          if (!cached_page) {
   142                                  desc->error = -ENOMEM;
   143                                  break;
   144                          }
   145                  }
   146                  error = add_to_page_cache_lru(cached_page, mapping,
   147                                                  index, GFP_KERNEL);
   148                  if (error) {
   149                          if (error == -EEXIST)
   150                                  goto find_page;
   151                          desc->error = error;
   152                          break;
   153                  }
   154                  page = cached_page;
   155                  cached_page = NULL;
   156                  goto readpage;
   157          }
   158
   159          *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
   160          if (cached_page)
   161                  page_cache_release(cached_page);
   162          UPDATE_ATIME(inode);
   163  }








^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 19:16           ` Benjamin LaHaise
@ 2003-03-31 19:07             ` Janet Morgan
  2003-04-01 20:24               ` Benjamin LaHaise
  2003-03-31 19:17             ` William Lee Irwin III
  2003-04-07  3:51             ` Suparna Bhattacharya
  2 siblings, 1 reply; 15+ messages in thread
From: Janet Morgan @ 2003-03-31 19:07 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: William Lee Irwin III, akpm, suparna, linux-aio, linux-kernel

Benjamin LaHaise wrote:

> On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
> > Can you tell whether these are due to hash collisions or contention on
> > the same page?
>
> No, they're most likely waiting for io to complete.
>
> To clean this up I've got a patch to move from aio_read/write with all the
> parameters to a single parameter based rw-specific iocb.  That makes the
> retry for read and write more ameniable to sharing common logic akin to the
> wtd_ ops, which we need at the very least for the semaphore operations.
>
>                 -ben
>

Can you post the patch you're referring to?

Thanks,
-Janet



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 18:32       ` Janet Morgan
@ 2003-03-31 19:11         ` William Lee Irwin III
  2003-03-31 19:16           ` Benjamin LaHaise
  0 siblings, 1 reply; 15+ messages in thread
From: William Lee Irwin III @ 2003-03-31 19:11 UTC (permalink / raw)
  To: Janet Morgan; +Cc: akpm, suparna, bcrl, linux-aio, linux-kernel

On Mon, Mar 31, 2003 at 10:32:20AM -0800, Janet Morgan wrote:
>   70% of all calls to schedule were from __lock_page:
>         Based on the profile data, almost all calls to _lock_page were
>         from do_generic_mapping_read (see filemap.c/Line 101 below).
>         Suparna's patch already retries here.

Can you tell whether these are due to hash collisions or contention on
the same page?

If they're due to hash collisions, things could easily be done to help
(though they wouldn't guarantee not sleeping entirely they'd be good
for general performance).


-- wli

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 19:11         ` William Lee Irwin III
@ 2003-03-31 19:16           ` Benjamin LaHaise
  2003-03-31 19:07             ` Janet Morgan
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Benjamin LaHaise @ 2003-03-31 19:16 UTC (permalink / raw)
  To: William Lee Irwin III, Janet Morgan, akpm, suparna, linux-aio,
	linux-kernel

On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
> Can you tell whether these are due to hash collisions or contention on
> the same page?

No, they're most likely waiting for io to complete.

To clean this up I've got a patch to move from aio_read/write with all the 
parameters to a single parameter based rw-specific iocb.  That makes the 
retry for read and write more ameniable to sharing common logic akin to the 
wtd_ ops, which we need at the very least for the semaphore operations.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 19:16           ` Benjamin LaHaise
  2003-03-31 19:07             ` Janet Morgan
@ 2003-03-31 19:17             ` William Lee Irwin III
  2003-03-31 19:25               ` Benjamin LaHaise
  2003-04-07  3:51             ` Suparna Bhattacharya
  2 siblings, 1 reply; 15+ messages in thread
From: William Lee Irwin III @ 2003-03-31 19:17 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Janet Morgan, akpm, suparna, linux-aio, linux-kernel

On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
>> Can you tell whether these are due to hash collisions or contention on
>> the same page?

On Mon, Mar 31, 2003 at 02:16:29PM -0500, Benjamin LaHaise wrote:
> No, they're most likely waiting for io to complete.
> To clean this up I've got a patch to move from aio_read/write with all the 
> parameters to a single parameter based rw-specific iocb.  That makes the 
> retry for read and write more ameniable to sharing common logic akin to the 
> wtd_ ops, which we need at the very least for the semaphore operations.

I won't get in the way then. I just watch for things related to what I've
touched to make sure it isn't going wrong for anyone.


-- wli

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 19:17             ` William Lee Irwin III
@ 2003-03-31 19:25               ` Benjamin LaHaise
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin LaHaise @ 2003-03-31 19:25 UTC (permalink / raw)
  To: William Lee Irwin III, Janet Morgan, akpm, suparna, linux-aio,
	linux-kernel

On Mon, Mar 31, 2003 at 11:17:35AM -0800, William Lee Irwin III wrote:
> I won't get in the way then. I just watch for things related to what I've
> touched to make sure it isn't going wrong for anyone.

Longer term, I think you've got the right idea: we need to keep more 
statistics on io waits, as right now from a profiling point of view, any 
process that is blocked on io doesn't provide meaningful data to the 
profiler.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 19:07             ` Janet Morgan
@ 2003-04-01 20:24               ` Benjamin LaHaise
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin LaHaise @ 2003-04-01 20:24 UTC (permalink / raw)
  To: Janet Morgan
  Cc: William Lee Irwin III, akpm, suparna, linux-aio, linux-kernel

On Mon, Mar 31, 2003 at 11:07:55AM -0800, Janet Morgan wrote:
> Can you post the patch you're referring to?

Something like this...  It also converts the async rw ops into vectored  
form.

		-ben


 drivers/char/raw.c       |   23 ------
 fs/adfs/file.c           |    4 -
 fs/affs/file.c           |   19 -----
 fs/afs/file.c            |    2 
 fs/aio.c                 |  127 ++++++++++++++++++++++++-----------
 fs/befs/linuxvfs.c       |    4 -
 fs/bfs/file.c            |    4 -
 fs/block_dev.c           |   18 +----
 fs/cifs/cifsfs.c         |    4 -
 fs/direct-io.c           |   10 +-
 fs/ext2/file.c           |    4 -
 fs/ext2/inode.c          |    4 -
 fs/ext3/file.c           |   14 +--
 fs/ext3/inode.c          |    4 -
 fs/fat/file.c            |   15 +---
 fs/freevxfs/vxfs_inode.c |    2 
 fs/hpfs/file.c           |   14 ---
 fs/hpfs/inode.c          |    2 
 fs/jffs/inode-v23.c      |    4 -
 fs/jffs2/file.c          |    4 -
 fs/jfs/file.c            |    4 -
 fs/minix/file.c          |    4 -
 fs/nfs/file.c            |   31 +++-----
 fs/nfs/write.c           |    4 -
 fs/ntfs/aops.c           |    2 
 fs/ntfs/file.c           |    4 -
 fs/qnx4/file.c           |    4 -
 fs/ramfs/inode.c         |    4 -
 fs/read_write.c          |  166 +++++++++++++++++++++++------------------------
 fs/reiserfs/file.c       |    4 -
 fs/smbfs/file.c          |   20 ++---
 fs/sysv/file.c           |    4 -
 fs/udf/file.c            |   18 ++---
 fs/ufs/file.c            |    4 -
 include/linux/aio.h      |   67 ++++++++++++++++--
 include/linux/fs.h       |   32 +++------
 include/net/sock.h       |   14 ++-
 kernel/ksyms.c           |    5 -
 mm/filemap.c             |  145 ++++++++---------------------------------
 net/socket.c             |   67 ++++++++----------
 40 files changed, 400 insertions(+), 485 deletions(-)
diff -purN linux-2.5/drivers/char/raw.c aio-2.5/drivers/char/raw.c
--- linux-2.5/drivers/char/raw.c	Tue Apr  1 15:17:26 2003
+++ aio-2.5/drivers/char/raw.c	Mon Mar 24 15:39:48 2003
@@ -220,33 +220,12 @@ out:
 	return err;
 }
 
-static ssize_t raw_file_write(struct file *file, const char *buf,
-				   size_t count, loff_t *ppos)
-{
-	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
-	return generic_file_write_nolock(file, &local_iov, 1, ppos);
-}
-
-static ssize_t raw_file_aio_write(struct kiocb *iocb, const char *buf,
-					size_t count, loff_t pos)
-{
-	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
-	return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos);
-}
-
-
 static struct file_operations raw_fops = {
-	.read	=	generic_file_read,
 	.aio_read = 	generic_file_aio_read,
-	.write	=	raw_file_write,
-	.aio_write = 	raw_file_aio_write,
+	.aio_write = 	generic_file_aio_write_nolock,
 	.open	=	raw_open,
 	.release=	raw_release,
 	.ioctl	=	raw_ioctl,
-	.readv	= 	generic_file_readv,
-	.writev	= 	generic_file_writev,
 	.owner	=	THIS_MODULE,
 };
 
diff -purN linux-2.5/fs/adfs/file.c aio-2.5/fs/adfs/file.c
--- linux-2.5/fs/adfs/file.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/adfs/file.c	Mon Mar 24 11:25:00 2003
@@ -31,11 +31,11 @@
 #include "adfs.h"
 
 struct file_operations adfs_file_operations = {
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
 	.mmap		= generic_file_mmap,
 	.fsync		= file_fsync,
-	.write		= generic_file_write,
 	.sendfile	= generic_file_sendfile,
 };
 
diff -purN linux-2.5/fs/affs/file.c aio-2.5/fs/affs/file.c
--- linux-2.5/fs/affs/file.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/affs/file.c	Mon Mar 24 11:25:00 2003
@@ -39,14 +39,13 @@ static int affs_grow_extcache(struct ino
 static struct buffer_head *affs_alloc_extblock(struct inode *inode, struct buffer_head *bh, u32 ext);
 static inline struct buffer_head *affs_get_extblock(struct inode *inode, u32 ext);
 static struct buffer_head *affs_get_extblock_slow(struct inode *inode, u32 ext);
-static ssize_t affs_file_write(struct file *filp, const char *buf, size_t count, loff_t *ppos);
 static int affs_file_open(struct inode *inode, struct file *filp);
 static int affs_file_release(struct inode *inode, struct file *filp);
 
 struct file_operations affs_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= affs_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.open		= affs_file_open,
 	.release	= affs_file_release,
@@ -490,20 +489,6 @@ affs_getemptyblk_ino(struct inode *inode
 	return ERR_PTR(err);
 }
 
-static ssize_t
-affs_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
-{
-	ssize_t retval;
-
-	retval = generic_file_write (file, buf, count, ppos);
-	if (retval >0) {
-		struct inode *inode = file->f_dentry->d_inode;
-		inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-		mark_inode_dirty(inode);
-	}
-	return retval;
-}
-
 static int
 affs_do_readpage_ofs(struct file *file, struct page *page, unsigned from, unsigned to)
 {
diff -purN linux-2.5/fs/afs/file.c aio-2.5/fs/afs/file.c
--- linux-2.5/fs/afs/file.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/afs/file.c	Mon Mar 24 11:25:00 2003
@@ -35,7 +35,7 @@ struct inode_operations afs_file_inode_o
 };
 
 struct file_operations afs_file_file_operations = {
-	.read		= generic_file_read,
+	.aio_read	= generic_file_aio_read,
 	.write		= afs_file_write,
 	.mmap		= generic_file_readonly_mmap,
 #if 0
diff -purN linux-2.5/fs/aio.c aio-2.5/fs/aio.c
--- linux-2.5/fs/aio.c	Tue Mar 25 16:50:38 2003
+++ aio-2.5/fs/aio.c	Mon Mar 31 14:41:41 2003
@@ -64,7 +64,7 @@ static void aio_kick_handler(void *);
  */
 static int __init aio_setup(void)
 {
-	kiocb_cachep = kmem_cache_create("kiocb", sizeof(struct kiocb),
+	kiocb_cachep = kmem_cache_create("kiocb", sizeof(struct sync_iocb),
 				0, SLAB_HWCACHE_ALIGN, NULL, NULL);
 	if (!kiocb_cachep)
 		panic("unable to create kiocb cache\n");
@@ -148,7 +148,7 @@ static int aio_setup_ring(struct kioctx 
 
 	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
 	info->nr_pages = get_user_pages(current, ctx->mm,
-					info->mmap_base, info->mmap_size, 
+					info->mmap_base, nr_pages, 
 					1, 0, info->ring_pages, NULL);
 	up_write(&ctx->mm->mmap_sem);
 
@@ -790,6 +790,20 @@ static inline void set_timeout(long star
 	add_timer(&to->timer);
 }
 
+static inline void update_ts(struct timespec *ts, long jiffies)
+{
+	struct timespec tmp;
+	jiffies_to_timespec(jiffies, &tmp);
+	ts->tv_sec -= tmp.tv_sec;
+	ts->tv_nsec -= tmp.tv_nsec;
+	if (ts->tv_nsec < 0) {
+		ts->tv_nsec += 1000000000;
+		ts->tv_sec -= 1;
+	}
+	if (ts->tv_sec < 0)
+		ts->tv_sec = ts->tv_nsec = 0;
+}
+
 static inline void clear_timeout(struct timeout *to)
 {
 	del_timer_sync(&to->timer);
@@ -807,6 +821,7 @@ static int read_events(struct kioctx *ct
 	int			i = 0;
 	struct io_event		ent;
 	struct timeout		to;
+	struct timespec		ts;
 
 	/* needed to zero any padding within an entry (there shouldn't be 
 	 * any, but C is fun!
@@ -844,7 +859,6 @@ static int read_events(struct kioctx *ct
 
 	init_timeout(&to);
 	if (timeout) {
-		struct timespec	ts;
 		ret = -EFAULT;
 		if (unlikely(copy_from_user(&ts, timeout, sizeof(ts))))
 			goto out;
@@ -890,8 +904,12 @@ static int read_events(struct kioctx *ct
 		i ++;
 	}
 
-	if (timeout)
+	if (timeout) {
 		clear_timeout(&to);
+		update_ts(&ts, jiffies - start_jiffies);
+		if (copy_to_user(timeout, &ts, sizeof(ts)))
+			ret = -EFAULT;
+	}
 out:
 	return i ? i : ret;
 }
@@ -984,6 +1002,63 @@ asmlinkage long sys_io_destroy(aio_conte
 	return -EINVAL;
 }
 
+static ssize_t rw_issue(struct rw_iocb *rw, struct file *file,
+		        struct iocb *iocb, ssize_t (*op)(struct rw_iocb *))
+{
+	ssize_t ret;
+
+	if (unlikely(NULL == op))
+		return -EINVAL;
+	if (unlikely(!(file->f_mode &
+			(rw->rw == WRITE ? FMODE_WRITE : FMODE_READ))))
+		return -EBADF;
+	if (unlikely(!access_ok((rw->rw == WRITE ? VERIFY_READ : VERIFY_WRITE),
+				rw->rw_local_iov.iov_base,
+				rw->rw_local_iov.iov_len)))
+		return -EFAULT;
+
+	rw->rw_local_iov.iov_base = (char *)(unsigned long)iocb->aio_buf;
+	rw->rw_local_iov.iov_len = iocb->aio_nbytes;
+	rw->rw_nsegs = 1;
+	rw->rw_iov = &rw->rw_local_iov;
+	rw->rw_pos = iocb->aio_offset;
+
+	ret = op(rw);
+	if (-EIOCBQUEUED != ret)
+		aio_complete(&rw->kiocb, ret, 0);
+	return 0;
+}
+
+static ssize_t io_submit_pread(struct kiocb *kiocb, struct file *file,
+			       struct iocb *iocb)
+{
+	struct rw_iocb *rw = kiocb_to_rw_iocb(kiocb);
+	rw->rw = READ;
+	return rw_issue(rw, file, iocb, file->f_op->aio_read);
+}
+
+static ssize_t io_submit_pwrite(struct kiocb *kiocb, struct file *file,
+				struct iocb *iocb)
+{
+	struct rw_iocb *rw = (struct rw_iocb *)kiocb;
+	rw->rw = WRITE;
+	return rw_issue(rw, file, iocb, file->f_op->aio_write);
+}
+static ssize_t io_submit_fsync(struct kiocb *kiocb, struct file *file,
+			       int dsync)
+{
+	struct fsync_iocb *fsync_iocb = kiocb_to_fsync_iocb(kiocb);
+	ssize_t ret = -EINVAL;
+	fsync_iocb->dsync = dsync;
+	if (NULL != file->f_op->aio_fsync) {
+		ret = file->f_op->aio_fsync(fsync_iocb);
+		if (-EIOCBQUEUED != ret)
+			aio_complete(&fsync_iocb->kiocb, ret, 0);
+		ret = 0;
+	}
+	return ret;
+}
+
 static int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
 				  struct iocb *iocb));
 static int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -992,7 +1067,6 @@ static int io_submit_one(struct kioctx *
 	struct kiocb *req;
 	struct file *file;
 	ssize_t ret;
-	char *buf;
 
 	/* enforce forwards compatibility on users */
 	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1031,57 +1105,30 @@ static int io_submit_one(struct kioctx *
 
 	req->ki_user_obj = user_iocb;
 	req->ki_user_data = iocb->aio_data;
-	req->ki_pos = iocb->aio_offset;
-
-	buf = (char *)(unsigned long)iocb->aio_buf;
 
 	switch (iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_READ)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_read)
-			ret = file->f_op->aio_read(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
+		ret = io_submit_pread(req, file, iocb);
 		break;
 	case IOCB_CMD_PWRITE:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_WRITE)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_write)
-			ret = file->f_op->aio_write(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
+		ret = io_submit_pwrite(req, file, iocb);
 		break;
 	case IOCB_CMD_FDSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 1);
+		ret = io_submit_fsync(req, file, 1);
 		break;
 	case IOCB_CMD_FSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 0);
+		ret = io_submit_fsync(req, file, 0);
 		break;
 	default:
 		dprintk("EINVAL: io_submit: no operation provided\n");
 		ret = -EINVAL;
 	}
 
-	if (likely(-EIOCBQUEUED == ret))
-		return 0;
-	aio_complete(req, ret, 0);
-	return 0;
-
+	if (-EIOCBQUEUED == ret)
+		ret = 0;
 out_put_req:
-	aio_put_req(req);
+	if (ret)
+		aio_put_req(req);
 	return ret;
 }
 
diff -purN linux-2.5/fs/befs/linuxvfs.c aio-2.5/fs/befs/linuxvfs.c
--- linux-2.5/fs/befs/linuxvfs.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/befs/linuxvfs.c	Mon Mar 24 11:25:00 2003
@@ -73,7 +73,7 @@ struct inode_operations befs_dir_inode_o
 
 struct file_operations befs_file_operations = {
 	.llseek		= default_llseek,
-	.read		= generic_file_read,
+	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 };
 
@@ -89,7 +89,7 @@ static struct inode_operations befs_syml
 };
 
 /* 
- * Called by generic_file_read() to read a page of data
+ * Called by generic_file_aio_read() to read a page of data
  * 
  * In turn, simply calls a generic block read function and
  * passes it the address of befs_get_block, for mapping file
diff -purN linux-2.5/fs/bfs/file.c aio-2.5/fs/bfs/file.c
--- linux-2.5/fs/bfs/file.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/bfs/file.c	Mon Mar 24 11:25:00 2003
@@ -19,8 +19,8 @@
 
 struct file_operations bfs_file_operations = {
 	.llseek 	= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= generic_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.sendfile	= generic_file_sendfile,
 };
diff -purN linux-2.5/fs/block_dev.c aio-2.5/fs/block_dev.c
--- linux-2.5/fs/block_dev.c	Tue Apr  1 15:17:52 2003
+++ aio-2.5/fs/block_dev.c	Mon Mar 24 17:25:12 2003
@@ -118,10 +118,10 @@ blkdev_get_blocks(struct inode *inode, s
 }
 
 static int
-blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+blkdev_direct_IO(int rw, struct rw_iocb *iocb, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
+	struct file *file = iocb->kiocb.ki_filp;
 	struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
 
 	return blockdev_direct_IO(rw, iocb, inode, inode->i_bdev, iov, offset,
@@ -694,14 +694,6 @@ int blkdev_close(struct inode * inode, s
 	return blkdev_put(inode->i_bdev, BDEV_FILE);
 }
 
-static ssize_t blkdev_file_write(struct file *file, const char *buf,
-				   size_t count, loff_t *ppos)
-{
-	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
-	return generic_file_write_nolock(file, &local_iov, 1, ppos);
-}
-
 struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.writepage	= blkdev_writepage,
@@ -716,13 +708,11 @@ struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
 	.llseek		= block_llseek,
-	.read		= generic_file_read,
-	.write		= blkdev_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write_nolock,
 	.mmap		= generic_file_mmap,
 	.fsync		= block_fsync,
 	.ioctl		= blkdev_ioctl,
-	.readv		= generic_file_readv,
-	.writev		= generic_file_writev,
 	.sendfile	= generic_file_sendfile,
 };
 
diff -purN linux-2.5/fs/cifs/cifsfs.c aio-2.5/fs/cifs/cifsfs.c
--- linux-2.5/fs/cifs/cifsfs.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/cifs/cifsfs.c	Mon Mar 24 11:25:00 2003
@@ -316,8 +316,8 @@ struct inode_operations cifs_symlink_ino
 };
 
 struct file_operations cifs_file_ops = {
-	.read = generic_file_read,
-	.write = generic_file_write, 
+	.aio_read = generic_file_aio_read,
+	.aio_write = generic_file_aio_write, 
 	.open = cifs_open,
 	.release = cifs_close,
 	.lock = cifs_lock,
diff -purN linux-2.5/fs/direct-io.c aio-2.5/fs/direct-io.c
--- linux-2.5/fs/direct-io.c	Tue Apr  1 15:17:52 2003
+++ aio-2.5/fs/direct-io.c	Mon Mar 24 17:26:02 2003
@@ -113,7 +113,7 @@ struct dio {
 	struct task_struct *waiter;	/* waiting task (NULL if none) */
 
 	/* AIO related stuff */
-	struct kiocb *iocb;		/* kiocb */
+	struct rw_iocb *iocb;		/* kiocb */
 	int is_async;			/* is IO async ? */
 	int result;			/* IO result */
 };
@@ -200,7 +200,7 @@ static void finished_one_bio(struct dio 
 {
 	if (atomic_dec_and_test(&dio->bio_count)) {
 		if(dio->is_async) {
-			aio_complete(dio->iocb, dio->result, 0);
+			aio_complete(&dio->iocb->kiocb, dio->result, 0);
 			kfree(dio);
 		}
 	}
@@ -822,7 +822,7 @@ out:
 }
 
 static int
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
+direct_io_worker(int rw, struct rw_iocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
 	unsigned blkbits, get_blocks_t get_blocks)
 {
@@ -836,7 +836,7 @@ direct_io_worker(int rw, struct kiocb *i
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	if (!dio)
 		return -ENOMEM;
-	dio->is_async = !is_sync_kiocb(iocb);
+	dio->is_async = !is_sync_kiocb(&iocb->kiocb);
 
 	dio->bio = NULL;
 	dio->inode = inode;
@@ -960,7 +960,7 @@ direct_io_worker(int rw, struct kiocb *i
  * This is a library function for use by filesystem drivers.
  */
 int
-blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, 
+blockdev_direct_IO(int rw, struct rw_iocb *iocb, struct inode *inode, 
 	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
 	unsigned long nr_segs, get_blocks_t get_blocks)
 {
diff -purN linux-2.5/fs/ext2/file.c aio-2.5/fs/ext2/file.c
--- linux-2.5/fs/ext2/file.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/ext2/file.c	Mon Mar 24 15:39:52 2003
@@ -41,8 +41,6 @@ static int ext2_release_file (struct ino
  */
 struct file_operations ext2_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= generic_file_write,
 	.aio_read	= generic_file_aio_read,
 	.aio_write	= generic_file_aio_write,
 	.ioctl		= ext2_ioctl,
@@ -50,8 +48,6 @@ struct file_operations ext2_file_operati
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_sync_file,
-	.readv		= generic_file_readv,
-	.writev		= generic_file_writev,
 	.sendfile	= generic_file_sendfile,
 };
 
diff -purN linux-2.5/fs/ext2/inode.c aio-2.5/fs/ext2/inode.c
--- linux-2.5/fs/ext2/inode.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/ext2/inode.c	Mon Mar 24 17:26:06 2003
@@ -650,10 +650,10 @@ ext2_get_blocks(struct inode *inode, sec
 }
 
 static int
-ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+ext2_direct_IO(int rw, struct rw_iocb *iocb, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
+	struct file *file = iocb->kiocb.ki_filp;
 	struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
 
 	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
diff -purN linux-2.5/fs/ext3/file.c aio-2.5/fs/ext3/file.c
--- linux-2.5/fs/ext3/file.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/ext3/file.c	Mon Mar 24 17:26:09 2003
@@ -56,13 +56,13 @@ static int ext3_open_file (struct inode 
 }
 
 static ssize_t
-ext3_file_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos)
+ext3_file_write(struct rw_iocb *iocb)
 {
-	struct file *file = iocb->ki_filp;
+	struct file *file = iocb->kiocb.ki_filp;
 	struct inode *inode = file->f_dentry->d_inode;
 	int ret, err;
 
-	ret = generic_file_aio_write(iocb, buf, count, pos);
+	ret = generic_file_aio_write(iocb);
 
 	/*
 	 * Skip flushing if there was an error, or if nothing was written.
@@ -114,12 +114,8 @@ force_commit:
 
 struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= do_sync_read,
-	.write		= do_sync_write,
-	.aio_read		= generic_file_aio_read,
-	.aio_write		= ext3_file_write,
-	.readv		= generic_file_readv,
-	.writev		= generic_file_writev,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= ext3_file_write,
 	.ioctl		= ext3_ioctl,
 	.mmap		= generic_file_mmap,
 	.open		= ext3_open_file,
diff -purN linux-2.5/fs/ext3/inode.c aio-2.5/fs/ext3/inode.c
--- linux-2.5/fs/ext3/inode.c	Tue Apr  1 15:17:53 2003
+++ aio-2.5/fs/ext3/inode.c	Mon Mar 24 17:26:13 2003
@@ -1426,11 +1426,11 @@ static int ext3_releasepage(struct page 
  * If the O_DIRECT write is intantiating holes inside i_size and the machine
  * crashes then stale disk data _may_ be exposed inside the file.
  */
-static int ext3_direct_IO(int rw, struct kiocb *iocb,
+static int ext3_direct_IO(int rw, struct rw_iocb *iocb,
 			const struct iovec *iov, loff_t offset,
 			unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
+	struct file *file = iocb->kiocb.ki_filp;
 	struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
 	struct ext3_inode_info *ei = EXT3_I(inode);
 	handle_t *handle = NULL;
diff -purN linux-2.5/fs/fat/file.c aio-2.5/fs/fat/file.c
--- linux-2.5/fs/fat/file.c	Tue Apr  1 15:17:54 2003
+++ aio-2.5/fs/fat/file.c	Mon Mar 24 17:54:55 2003
@@ -11,13 +11,12 @@
 #include <linux/smp_lock.h>
 #include <linux/buffer_head.h>
 
-static ssize_t fat_file_write(struct file *filp, const char *buf, size_t count,
-			      loff_t *ppos);
+static ssize_t fat_file_write(struct rw_iocb *iocb);
 
 struct file_operations fat_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= fat_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= fat_file_write,
 	.mmap		= generic_file_mmap,
 	.fsync		= file_fsync,
 	.sendfile	= generic_file_sendfile,
@@ -65,14 +64,14 @@ int fat_get_block(struct inode *inode, s
 	return 0;
 }
 
-static ssize_t fat_file_write(struct file *filp, const char *buf, size_t count,
-			      loff_t *ppos)
+static ssize_t fat_file_write(struct rw_iocb *iocb)
 {
-	struct inode *inode = filp->f_dentry->d_inode;
+	struct file *filp = iocb->kiocb.ki_filp;
 	int retval;
 
-	retval = generic_file_write(filp, buf, count, ppos);
+	retval = generic_file_aio_write(iocb);
 	if (retval > 0) {
+		struct inode *inode = filp->f_dentry->d_inode;
 		inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		MSDOS_I(inode)->i_attrs |= ATTR_ARCH;
 		mark_inode_dirty(inode);
diff -purN linux-2.5/fs/freevxfs/vxfs_inode.c aio-2.5/fs/freevxfs/vxfs_inode.c
--- linux-2.5/fs/freevxfs/vxfs_inode.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/freevxfs/vxfs_inode.c	Mon Mar 24 11:25:00 2003
@@ -51,7 +51,7 @@ extern struct inode_operations vxfs_imme
 static struct file_operations vxfs_file_operations = {
 	.open =			generic_file_open,
 	.llseek =		generic_file_llseek,
-	.read =			generic_file_read,
+	.aio_read =		generic_file_aio_read,
 	.mmap =			generic_file_mmap,
 	.sendfile =		generic_file_sendfile,
 };
diff -purN linux-2.5/fs/hpfs/file.c aio-2.5/fs/hpfs/file.c
--- linux-2.5/fs/hpfs/file.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/hpfs/file.c	Mon Mar 24 11:25:00 2003
@@ -123,17 +123,3 @@ struct address_space_operations hpfs_aop
 	.commit_write = generic_commit_write,
 	.bmap = _hpfs_bmap
 };
-
-ssize_t hpfs_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
-{
-	ssize_t retval;
-
-	retval = generic_file_write(file, buf, count, ppos);
-	if (retval > 0) {
-		struct inode *inode = file->f_dentry->d_inode;
-		inode->i_mtime = CURRENT_TIME;
-		hpfs_i(inode)->i_dirty = 1;
-	}
-	return retval;
-}
-
diff -purN linux-2.5/fs/hpfs/inode.c aio-2.5/fs/hpfs/inode.c
--- linux-2.5/fs/hpfs/inode.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/hpfs/inode.c	Mon Mar 24 11:25:00 2003
@@ -15,7 +15,7 @@
 static struct file_operations hpfs_file_ops =
 {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
+	.aio_read	= generic_file_aio_read,
 	.write		= hpfs_file_write,
 	.mmap		= generic_file_mmap,
 	.open		= hpfs_open,
diff -purN linux-2.5/fs/jffs/inode-v23.c aio-2.5/fs/jffs/inode-v23.c
--- linux-2.5/fs/jffs/inode-v23.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/jffs/inode-v23.c	Mon Mar 24 11:25:00 2003
@@ -1639,8 +1639,8 @@ static struct file_operations jffs_file_
 {
 	.open		= generic_file_open,
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= generic_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.ioctl		= jffs_ioctl,
 	.mmap		= generic_file_readonly_mmap,
 	.fsync		= jffs_fsync,
diff -purN linux-2.5/fs/jffs2/file.c aio-2.5/fs/jffs2/file.c
--- linux-2.5/fs/jffs2/file.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/jffs2/file.c	Mon Mar 24 11:46:18 2003
@@ -55,8 +55,8 @@ struct file_operations jffs2_file_operat
 {
 	.llseek =	generic_file_llseek,
 	.open =		generic_file_open,
-	.read =		generic_file_read,
-	.write =	generic_file_write,
+	.aio_read =	generic_file_aio_read,
+	.aio_write =	generic_file_aio_write,
 	.ioctl =	jffs2_ioctl,
 	.mmap =		generic_file_readonly_mmap,
 	.fsync =	jffs2_fsync,
diff -purN linux-2.5/fs/jfs/file.c aio-2.5/fs/jfs/file.c
--- linux-2.5/fs/jfs/file.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/jfs/file.c	Mon Mar 24 15:40:01 2003
@@ -100,13 +100,9 @@ struct inode_operations jfs_file_inode_o
 struct file_operations jfs_file_operations = {
 	.open		= jfs_open,
 	.llseek		= generic_file_llseek,
-	.write		= generic_file_write,
-	.read		= generic_file_read,
 	.aio_read	= generic_file_aio_read,
 	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
-	.readv		= generic_file_readv,
-	.writev		= generic_file_writev,
  	.sendfile	= generic_file_sendfile,
 	.fsync		= jfs_fsync,
 	.release	= jfs_release,
diff -purN linux-2.5/fs/minix/file.c aio-2.5/fs/minix/file.c
--- linux-2.5/fs/minix/file.c	Tue Apr  1 15:17:55 2003
+++ aio-2.5/fs/minix/file.c	Mon Mar 24 11:25:00 2003
@@ -17,8 +17,8 @@ int minix_sync_file(struct file *, struc
 
 struct file_operations minix_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= generic_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.fsync		= minix_sync_file,
 	.sendfile	= generic_file_sendfile,
diff -purN linux-2.5/fs/nfs/file.c aio-2.5/fs/nfs/file.c
--- linux-2.5/fs/nfs/file.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/nfs/file.c	Mon Mar 24 17:26:18 2003
@@ -36,17 +36,15 @@
 
 static int  nfs_file_mmap(struct file *, struct vm_area_struct *);
 static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *);
-static ssize_t nfs_file_read(struct kiocb *, char *, size_t, loff_t);
-static ssize_t nfs_file_write(struct kiocb *, const char *, size_t, loff_t);
+static ssize_t nfs_file_read(struct rw_iocb *);
+static ssize_t nfs_file_write(struct rw_iocb *);
 static int  nfs_file_flush(struct file *);
 static int  nfs_fsync(struct file *, struct dentry *dentry, int datasync);
 
 struct file_operations nfs_file_operations = {
 	.llseek		= remote_llseek,
-	.read		= do_sync_read,
-	.write		= do_sync_write,
-	.aio_read		= nfs_file_read,
-	.aio_write		= nfs_file_write,
+	.aio_read	= nfs_file_read,
+	.aio_write	= nfs_file_write,
 	.mmap		= nfs_file_mmap,
 	.open		= nfs_open,
 	.flush		= nfs_file_flush,
@@ -88,19 +86,19 @@ nfs_file_flush(struct file *file)
 }
 
 static ssize_t
-nfs_file_read(struct kiocb *iocb, char * buf, size_t count, loff_t pos)
+nfs_file_read(struct rw_iocb *iocb)
 {
-	struct dentry * dentry = iocb->ki_filp->f_dentry;
+	struct dentry * dentry = iocb->kiocb.ki_filp->f_dentry;
 	struct inode * inode = dentry->d_inode;
 	ssize_t result;
 
 	dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
-		(unsigned long) count, (unsigned long) pos);
+		(unsigned long)iocb->rw_iov->iov_len, (unsigned long)iocb->rw_pos);
 
 	result = nfs_revalidate_inode(NFS_SERVER(inode), inode);
 	if (!result)
-		result = generic_file_aio_read(iocb, buf, count, pos);
+		result = generic_file_aio_read(iocb);
 	return result;
 }
 
@@ -202,15 +200,16 @@ struct address_space_operations nfs_file
  * Write to a file (through the page cache).
  */
 static ssize_t
-nfs_file_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos)
+nfs_file_write(struct rw_iocb *iocb)
 {
-	struct dentry * dentry = iocb->ki_filp->f_dentry;
+	struct dentry * dentry = iocb->kiocb.ki_filp->f_dentry;
 	struct inode * inode = dentry->d_inode;
 	ssize_t result;
 
 	dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
-		inode->i_ino, (unsigned long) count, (unsigned long) pos);
+		inode->i_ino, (unsigned long)iocb->rw_iov->iov_len,
+		(unsigned long)iocb->rw_pos);
 
 	result = -EBUSY;
 	if (IS_SWAPFILE(inode))
@@ -219,11 +218,7 @@ nfs_file_write(struct kiocb *iocb, const
 	if (result)
 		goto out;
 
-	result = count;
-	if (!count)
-		goto out;
-
-	result = generic_file_aio_write(iocb, buf, count, pos);
+	result = generic_file_aio_write(iocb);
 out:
 	return result;
 
diff -purN linux-2.5/fs/nfs/write.c aio-2.5/fs/nfs/write.c
--- linux-2.5/fs/nfs/write.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/nfs/write.c	Mon Mar 24 11:25:01 2003
@@ -28,7 +28,7 @@
  *
  *  -	A write request is in progress.
  *  -	A user process is in generic_file_write/nfs_update_page
- *  -	A user process is in generic_file_read
+ *  -	A user process is in generic_file_aio_read
  *
  * Also note that because of the way pages are invalidated in
  * nfs_revalidate_inode, the following assertions hold:
@@ -645,7 +645,7 @@ nfs_flush_incompatible(struct file *file
 /*
  * Update and possibly write a cached page of an NFS file.
  *
- * XXX: Keep an eye on generic_file_read to make sure it doesn't do bad
+ * XXX: Keep an eye on generic_file_aio_read to make sure it doesn't do bad
  * things with a page scheduled for an RPC call (e.g. invalidate it).
  */
 int
diff -purN linux-2.5/fs/ntfs/aops.c aio-2.5/fs/ntfs/aops.c
--- linux-2.5/fs/ntfs/aops.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/ntfs/aops.c	Mon Mar 24 11:25:01 2003
@@ -157,7 +157,7 @@ still_busy:
  * unlocking it.
  *
  * We only enforce allocated_size limit because i_size is checked for in
- * generic_file_read().
+ * generic_file_aio_read().
  *
  * Return 0 on success and -errno on error.
  *
diff -purN linux-2.5/fs/ntfs/file.c aio-2.5/fs/ntfs/file.c
--- linux-2.5/fs/ntfs/file.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/ntfs/file.c	Mon Mar 24 11:46:18 2003
@@ -50,9 +50,9 @@ static int ntfs_file_open(struct inode *
 
 struct file_operations ntfs_file_ops = {
 	.llseek		= generic_file_llseek,	/* Seek inside file. */
-	.read		= generic_file_read,	/* Read from file. */
+	.aio_read	= generic_file_aio_read,/* Read from file. */
 #ifdef NTFS_RW
-	.write		= generic_file_write,	/* Write to a file. */
+	.aio_write	= generic_file_aio_write,/* Write to a file. */
 #endif
 	.mmap		= generic_file_mmap,	/* Mmap file. */
 	.sendfile	= generic_file_sendfile,/* Zero-copy data send with the
diff -purN linux-2.5/fs/qnx4/file.c aio-2.5/fs/qnx4/file.c
--- linux-2.5/fs/qnx4/file.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/qnx4/file.c	Mon Mar 24 11:25:01 2003
@@ -25,11 +25,11 @@
 struct file_operations qnx4_file_operations =
 {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
+	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_mmap,
 	.sendfile	= generic_file_sendfile,
 #ifdef CONFIG_QNX4FS_RW
-	.write		= generic_file_write,
+	.aio_write	= generic_file_aio_write,
 	.fsync		= qnx4_sync_file,
 #endif
 };
diff -purN linux-2.5/fs/ramfs/inode.c aio-2.5/fs/ramfs/inode.c
--- linux-2.5/fs/ramfs/inode.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/ramfs/inode.c	Mon Mar 24 11:25:01 2003
@@ -141,8 +141,8 @@ static struct address_space_operations r
 };
 
 static struct file_operations ramfs_file_operations = {
-	.read		= generic_file_read,
-	.write		= generic_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_sync_file,
 	.sendfile	= generic_file_sendfile,
diff -purN linux-2.5/fs/read_write.c aio-2.5/fs/read_write.c
--- linux-2.5/fs/read_write.c	Tue Mar 25 16:50:38 2003
+++ aio-2.5/fs/read_write.c	Tue Mar 25 16:17:41 2003
@@ -18,7 +18,7 @@
 
 struct file_operations generic_ro_fops = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
+	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 	.sendfile	= generic_file_sendfile,
 };
@@ -167,28 +167,44 @@ bad:
 }
 #endif
 
-ssize_t do_sync_read(struct file *filp, char *buf, size_t len, loff_t *ppos)
+static ssize_t do_sync_rwv(struct file *filp, char *buf, size_t tot_len,
+			   struct iovec *iov, unsigned nr_segs, loff_t *ppos,
+			   ssize_t (*op)(struct rw_iocb *), int rw)
 {
-	struct kiocb kiocb;
+	struct sync_iocb sync_iocb;
+	struct rw_iocb *iocb = kiocb_to_rw_iocb(&sync_iocb.kiocb);
 	ssize_t ret;
 
-	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
-	ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos);
+	init_sync_kiocb(&iocb->kiocb, filp);
+	iocb->rw = rw;
+	iocb->rw_pos = *ppos;
+	iocb->rw_nsegs = nr_segs;
+	iocb->rw_iov = (NULL == iov) ? &iocb->rw_local_iov : iov;
+	iocb->rw_local_iov.iov_base = buf;
+	iocb->rw_local_iov.iov_len = tot_len;
+	
+	ret = op(iocb);
 	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&kiocb);
-	*ppos = kiocb.ki_pos;
+		ret = wait_on_sync_kiocb(&iocb->kiocb);
+	*ppos = iocb->rw_pos;
 	return ret;
 }
 
+ssize_t do_sync_rw(struct file *filp, char *buf, size_t count, loff_t *ppos,
+		   ssize_t (*op)(struct rw_iocb *), int rw)
+{
+	return do_sync_rwv(filp, buf, count, NULL, 1, ppos, op, rw);
+}
+
 ssize_t vfs_read(struct file *file, char *buf, size_t count, loff_t *pos)
 {
 	struct inode *inode = file->f_dentry->d_inode;
 	ssize_t ret;
 
-	if (!(file->f_mode & FMODE_READ))
+	if (unlikely(!(file->f_mode & FMODE_READ)))
 		return -EBADF;
-	if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
+	if (unlikely(!file->f_op ||
+		     (!file->f_op->read && !file->f_op->aio_read)))
 		return -EINVAL;
 
 	ret = locks_verify_area(FLOCK_VERIFY_READ, inode, file, *pos, count);
@@ -198,7 +214,8 @@ ssize_t vfs_read(struct file *file, char
 			if (file->f_op->read)
 				ret = file->f_op->read(file, buf, count, pos);
 			else
-				ret = do_sync_read(file, buf, count, pos);
+				ret = do_sync_rw(file, buf, count, pos,
+						 file->f_op->aio_read, READ);
 			if (ret > 0)
 				dnotify_parent(file->f_dentry, DN_ACCESS);
 		}
@@ -207,28 +224,15 @@ ssize_t vfs_read(struct file *file, char
 	return ret;
 }
 
-ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos)
-{
-	struct kiocb kiocb;
-	ssize_t ret;
-
-	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
-	ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos);
-	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&kiocb);
-	*ppos = kiocb.ki_pos;
-	return ret;
-}
-
 ssize_t vfs_write(struct file *file, const char *buf, size_t count, loff_t *pos)
 {
 	struct inode *inode = file->f_dentry->d_inode;
 	ssize_t ret;
 
-	if (!(file->f_mode & FMODE_WRITE))
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
 		return -EBADF;
-	if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+	if (unlikely(!file->f_op ||
+		     (!file->f_op->write && !file->f_op->aio_write)))
 		return -EINVAL;
 
 	ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, *pos, count);
@@ -238,7 +242,8 @@ ssize_t vfs_write(struct file *file, con
 			if (file->f_op->write)
 				ret = file->f_op->write(file, buf, count, pos);
 			else
-				ret = do_sync_write(file, buf, count, pos);
+				ret = do_sync_rw(file, (char *)buf, count, pos,
+						 file->f_op->aio_write, WRITE);
 			if (ret > 0)
 				dnotify_parent(file->f_dentry, DN_MODIFY);
 		}
@@ -331,29 +336,58 @@ unsigned long iov_shorten(struct iovec *
 	return seg;
 }
 
-static ssize_t do_readv_writev(int type, struct file *file,
+ssize_t compat_rwv(struct file *file, const struct iovec *vector,
+		   unsigned long nr_segs, loff_t *ppos,
+		   ssize_t (*fn)(struct file *, char *, size_t, loff_t *))
+{
+	ssize_t ret = 0;
+
+	if (NULL == fn)
+		return -EINVAL;
+
+	/* Do it by hand, with file-ops */
+	for (; nr_segs > 0; vector++, nr_segs--) {
+		void *base = vector->iov_base;
+		size_t len = vector->iov_len;
+		ssize_t nr = fn(file, base, len, ppos);
+
+		if (nr < 0) {
+			if (!ret)
+				ret = nr;
+			break;
+		}
+		ret += nr;
+		if (nr != len)
+			break;
+	}
+	return ret;
+}
+
+static ssize_t do_readv_writev(int fmode, int type, struct file *file,
 			       const struct iovec * vector,
 			       unsigned long nr_segs, loff_t *pos)
 {
 	typedef ssize_t (*io_fn_t)(struct file *, char *, size_t, loff_t *);
 	typedef ssize_t (*iov_fn_t)(struct file *, const struct iovec *, unsigned long, loff_t *);
+	typedef ssize_t (*ioa_fn_t)(struct rw_iocb *);
 
-	size_t tot_len;
+	size_t tot_len = 0;
 	struct iovec iovstack[UIO_FASTIOV];
-	struct iovec *iov=iovstack;
-	ssize_t ret;
+	struct iovec *iov = iovstack;
+	ssize_t ret = 0;
 	int seg;
 	io_fn_t fn;
 	iov_fn_t fnv;
-	struct inode *inode;
+	ioa_fn_t fna;
 
+	if (!(file->f_mode & fmode))
+		return -EBADF;
 	/*
 	 * SuS says "The readv() function *may* fail if the iovcnt argument
 	 * was less than or equal to 0, or greater than {IOV_MAX}.  Linux has
 	 * traditionally returned zero for zero segments, so...
 	 */
-	ret = 0;
-	if (nr_segs == 0)
+	if (unlikely(nr_segs == 0))
 		goto out;
 
 	/*
@@ -365,6 +399,7 @@ static ssize_t do_readv_writev(int type,
 		goto out;
 	if (!file->f_op)
 		goto out;
+
 	if (nr_segs > UIO_FASTIOV) {
 		ret = -ENOMEM;
 		iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL);
@@ -382,7 +417,6 @@ static ssize_t do_readv_writev(int type,
 	 *
 	 * Be careful here because iov_len is a size_t not an ssize_t
 	 */
-	tot_len = 0;
 	ret = -EINVAL;
 	for (seg = 0 ; seg < nr_segs; seg++) {
 		ssize_t tmp = tot_len;
@@ -393,55 +427,32 @@ static ssize_t do_readv_writev(int type,
 		if (tot_len < tmp) /* maths overflow on the ssize_t */
 			goto out;
 	}
-	if (tot_len == 0) {
-		ret = 0;
+	ret = 0;
+	if (tot_len == 0)
 		goto out;
-	}
 
-	inode = file->f_dentry->d_inode;
 	/* VERIFY_WRITE actually means a read, as we write to user space */
 	ret = locks_verify_area((type == READ 
 				 ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE),
-				inode, file, *pos, tot_len);
+				file->f_dentry->d_inode, file, *pos, tot_len);
 	if (ret)
 		goto out;
 
-	fnv = NULL;
 	if (type == READ) {
 		fn = file->f_op->read;
 		fnv = file->f_op->readv;
+		fna = file->f_op->aio_read;
 	} else {
 		fn = (io_fn_t)file->f_op->write;
 		fnv = file->f_op->writev;
+		fna = file->f_op->aio_write;
 	}
-	if (fnv) {
+	if (fnv)
 		ret = fnv(file, iov, nr_segs, pos);
-		goto out;
-	}
-
-	/* Do it by hand, with file-ops */
-	ret = 0;
-	vector = iov;
-	while (nr_segs > 0) {
-		void * base;
-		size_t len;
-		ssize_t nr;
-
-		base = vector->iov_base;
-		len = vector->iov_len;
-		vector++;
-		nr_segs--;
-
-		nr = fn(file, base, len, pos);
-
-		if (nr < 0) {
-			if (!ret) ret = nr;
-			break;
-		}
-		ret += nr;
-		if (nr != len)
-			break;
-	}
+	else if (fna)
+		ret = do_sync_rwv(file, NULL, tot_len, iov, nr_segs, pos, fna, type);
+	else
+		ret = compat_rwv(file, iov, nr_segs, pos, fn);
 out:
 	if (iov != iovstack)
 		kfree(iov);
@@ -454,23 +465,13 @@ out:
 ssize_t vfs_readv(struct file *file, const struct iovec *vec,
 		  unsigned long vlen, loff_t *pos)
 {
-	if (!(file->f_mode & FMODE_READ))
-		return -EBADF;
-	if (!file->f_op || (!file->f_op->readv && !file->f_op->read))
-		return -EINVAL;
-
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(FMODE_READ, READ, file, vec, vlen, pos);
 }
 
 ssize_t vfs_writev(struct file *file, const struct iovec *vec,
 		   unsigned long vlen, loff_t *pos)
 {
-	if (!(file->f_mode & FMODE_WRITE))
-		return -EBADF;
-	if (!file->f_op || (!file->f_op->writev && !file->f_op->write))
-		return -EINVAL;
-
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(FMODE_WRITE, WRITE, file, vec, vlen, pos);
 }
 
 
@@ -622,5 +623,4 @@ asmlinkage ssize_t sys_sendfile64(int ou
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 
-EXPORT_SYMBOL(do_sync_read);
-EXPORT_SYMBOL(do_sync_write);
+EXPORT_SYMBOL(do_sync_rw);
diff -purN linux-2.5/fs/reiserfs/file.c aio-2.5/fs/reiserfs/file.c
--- linux-2.5/fs/reiserfs/file.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/reiserfs/file.c	Mon Mar 24 11:25:01 2003
@@ -141,8 +141,8 @@ out:
 }
 
 struct file_operations reiserfs_file_operations = {
-    .read	= generic_file_read,
-    .write	= generic_file_write,
+    .aio_read	= generic_file_aio_read,
+    .aio_write	= generic_file_aio_write,
     .ioctl	= reiserfs_ioctl,
     .mmap	= generic_file_mmap,
     .release	= reiserfs_file_release,
diff -purN linux-2.5/fs/smbfs/file.c aio-2.5/fs/smbfs/file.c
--- linux-2.5/fs/smbfs/file.c	Tue Apr  1 15:17:56 2003
+++ aio-2.5/fs/smbfs/file.c	Mon Mar 24 11:46:18 2003
@@ -216,13 +216,13 @@ smb_updatepage(struct file *file, struct
 }
 
 static ssize_t
-smb_file_read(struct file * file, char * buf, size_t count, loff_t *ppos)
+smb_file_read(struct kiocb *iocb, char *buf, size_t count, loff_t pos)
 {
-	struct dentry * dentry = file->f_dentry;
+	struct dentry * dentry = iocb->ki_filp->f_dentry;
 	ssize_t	status;
 
 	VERBOSE("file %s/%s, count=%lu@%lu\n", DENTRY_PATH(dentry),
-		(unsigned long) count, (unsigned long) *ppos);
+		(unsigned long) count, (unsigned long) pos);
 
 	status = smb_revalidate_inode(dentry);
 	if (status) {
@@ -235,7 +235,7 @@ smb_file_read(struct file * file, char *
 		(long)dentry->d_inode->i_size,
 		dentry->d_inode->i_flags, dentry->d_inode->i_atime);
 
-	status = generic_file_read(file, buf, count, ppos);
+	status = generic_file_aio_read(iocb, buf, count, pos);
 out:
 	return status;
 }
@@ -298,14 +298,14 @@ struct address_space_operations smb_file
  * Write to a file (through the page cache).
  */
 static ssize_t
-smb_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
+smb_file_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos)
 {
-	struct dentry * dentry = file->f_dentry;
+	struct dentry * dentry = iocb->ki_filp->f_dentry;
 	ssize_t	result;
 
 	VERBOSE("file %s/%s, count=%lu@%lu\n",
 		DENTRY_PATH(dentry),
-		(unsigned long) count, (unsigned long) *ppos);
+		(unsigned long) count, (unsigned long) pos);
 
 	result = smb_revalidate_inode(dentry);
 	if (result) {
@@ -319,7 +319,7 @@ smb_file_write(struct file *file, const 
 		goto out;
 
 	if (count > 0) {
-		result = generic_file_write(file, buf, count, ppos);
+		result = generic_file_aio_write(iocb, buf, count, pos);
 		VERBOSE("pos=%ld, size=%ld, mtime=%ld, atime=%ld\n",
 			(long) file->f_pos, (long) dentry->d_inode->i_size,
 			dentry->d_inode->i_mtime, dentry->d_inode->i_atime);
@@ -384,8 +384,8 @@ smb_file_permission(struct inode *inode,
 struct file_operations smb_file_operations =
 {
 	.llseek		= remote_llseek,
-	.read		= smb_file_read,
-	.write		= smb_file_write,
+	.aio_read	= smb_file_read,
+	.aio_write	= smb_file_write,
 	.ioctl		= smb_ioctl,
 	.mmap		= smb_file_mmap,
 	.open		= smb_file_open,
diff -purN linux-2.5/fs/sysv/file.c aio-2.5/fs/sysv/file.c
--- linux-2.5/fs/sysv/file.c	Tue Apr  1 15:17:57 2003
+++ aio-2.5/fs/sysv/file.c	Mon Mar 24 11:25:05 2003
@@ -21,8 +21,8 @@
  */
 struct file_operations sysv_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= generic_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.fsync		= sysv_sync_file,
 	.sendfile	= generic_file_sendfile,
diff -purN linux-2.5/fs/udf/file.c aio-2.5/fs/udf/file.c
--- linux-2.5/fs/udf/file.c	Tue Apr  1 15:17:57 2003
+++ aio-2.5/fs/udf/file.c	Mon Mar 24 11:25:05 2003
@@ -109,19 +109,17 @@ struct address_space_operations udf_adin
 	.commit_write		= udf_adinicb_commit_write,
 };
 
-static ssize_t udf_file_write(struct file * file, const char * buf,
-	size_t count, loff_t *ppos)
+static ssize_t udf_file_write(struct kiocb *iocb, const char * buf,
+			      size_t count, loff_t pos)
 {
 	ssize_t retval;
-	struct inode *inode = file->f_dentry->d_inode;
-	int err, pos;
+	struct inode *inode = iocb->ki_filp->f_dentry->d_inode;
+	int err;
 
 	if (UDF_I_ALLOCTYPE(inode) == ICBTAG_FLAG_AD_IN_ICB)
 	{
-		if (file->f_flags & O_APPEND)
+		if (iocb->ki_filp->f_flags & O_APPEND)
 			pos = inode->i_size;
-		else
-			pos = *ppos;
 
 		if (inode->i_sb->s_blocksize < (udf_file_entry_alloc_offset(inode) +
 			pos + count))
@@ -142,7 +140,7 @@ static ssize_t udf_file_write(struct fil
 		}
 	}
 
-	retval = generic_file_write(file, buf, count, ppos);
+	retval = generic_file_aio_write(iocb, buf, count, pos);
 
 	if (retval > 0)
 		mark_inode_dirty(inode);
@@ -275,11 +273,11 @@ static int udf_open_file(struct inode * 
 }
 
 struct file_operations udf_file_operations = {
-	.read			= generic_file_read,
+	.aio_read		= generic_file_aio_read,
+	.aio_write		= udf_file_write,
 	.ioctl			= udf_ioctl,
 	.open			= udf_open_file,
 	.mmap			= generic_file_mmap,
-	.write			= udf_file_write,
 	.release		= udf_release_file,
 	.fsync			= udf_fsync_file,
 	.sendfile		= generic_file_sendfile,
diff -purN linux-2.5/fs/ufs/file.c aio-2.5/fs/ufs/file.c
--- linux-2.5/fs/ufs/file.c	Tue Apr  1 15:17:57 2003
+++ aio-2.5/fs/ufs/file.c	Mon Mar 24 11:25:05 2003
@@ -43,8 +43,8 @@
  
 struct file_operations ufs_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= generic_file_read,
-	.write		= generic_file_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.open           = generic_file_open,
 	.sendfile	= generic_file_sendfile,
diff -purN linux-2.5/include/linux/aio.h aio-2.5/include/linux/aio.h
--- linux-2.5/include/linux/aio.h	Tue Apr  1 15:18:08 2003
+++ aio-2.5/include/linux/aio.h	Tue Mar 25 15:35:56 2003
@@ -4,6 +4,7 @@
 #include <linux/list.h>
 #include <linux/workqueue.h>
 #include <linux/aio_abi.h>
+#include <linux/uio.h>
 
 #include <asm/atomic.h>
 
@@ -61,22 +62,68 @@ struct kiocb {
 
 	void			*ki_user_obj;	/* pointer to userland's iocb */
 	__u64			ki_user_data;	/* user's data for completion */
-	loff_t			ki_pos;
 
-	char			private[KIOCB_PRIVATE_SIZE];
+	long			private[0];	/* KIOCB_PRIVATE_SIZE alloc'd */
 };
 
-#define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
+struct rw_iocb {
+	struct kiocb	kiocb;
+	loff_t		rw_pos;
+	unsigned long	rw_nsegs;
+
+	struct iovec	*rw_iov;
+	struct iovec	rw_local_iov;
+
+	unsigned	rw : 2,			/* READ or WRITE */
+			rw_have_i_sem : 1;	/* true if we hold i_sem */
+};
+
+struct fsync_iocb {
+	struct kiocb	kiocb;
+	unsigned	dsync;
+};
+
+struct sync_iocb {
+	union {
+	struct kiocb		kiocb;
+	struct rw_iocb		rw_iocb;
+	struct fsync_iocb	fsync_iocb;
+	};
+	char		private[KIOCB_PRIVATE_SIZE];
+};
+
+typedef union {
+	struct kiocb		kiocb;
+	struct rw_iocb		rw_iocb;
+	struct sync_iocb	sync_iocb;
+} iocb_t;
+
+static inline struct rw_iocb *kiocb_to_rw_iocb(struct kiocb *iocb)
+{
+	return (struct rw_iocb *)iocb;
+}
+
+static inline struct fsync_iocb *kiocb_to_fsync_iocb(struct kiocb *iocb)
+{
+	return (struct fsync_iocb *)iocb;
+}
+
+static inline int is_sync_kiocb(struct kiocb *iocb)
+{
+	return iocb->ki_key == KIOCB_SYNC_KEY;
+}
+
 #define init_sync_kiocb(x, filp)			\
 	do {						\
+		struct kiocb *__iocb = (x);		\
 		struct task_struct *tsk = current;	\
-		(x)->ki_flags = 0;			\
-		(x)->ki_users = 1;			\
-		(x)->ki_key = KIOCB_SYNC_KEY;		\
-		(x)->ki_filp = (filp);			\
-		(x)->ki_ctx = &tsk->active_mm->default_kioctx;	\
-		(x)->ki_cancel = NULL;			\
-		(x)->ki_user_obj = tsk;			\
+		__iocb->ki_flags = 0;			\
+		__iocb->ki_users = 1;			\
+		__iocb->ki_key = KIOCB_SYNC_KEY;	\
+		__iocb->ki_filp = (filp);		\
+		__iocb->ki_ctx = &tsk->active_mm->default_kioctx;\
+		__iocb->ki_cancel = NULL;		\
+		__iocb->ki_user_obj = tsk;		\
 	} while (0)
 
 #define AIO_RING_MAGIC			0xa10a10a1
diff -purN linux-2.5/include/linux/fs.h aio-2.5/include/linux/fs.h
--- linux-2.5/include/linux/fs.h	Tue Mar 25 16:38:50 2003
+++ aio-2.5/include/linux/fs.h	Tue Mar 25 15:29:13 2003
@@ -21,6 +21,8 @@
 #include <linux/kobject.h>
 #include <asm/atomic.h>
 
+struct rw_iocb;
+struct fsync_iocb;
 struct iovec;
 struct nameidata;
 struct pipe_inode_info;
@@ -305,7 +307,7 @@ struct address_space_operations {
 	sector_t (*bmap)(struct address_space *, sector_t);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
-	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+	int (*direct_IO)(int, struct rw_iocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
 };
 
@@ -706,9 +708,9 @@ struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
 	ssize_t (*read) (struct file *, char *, size_t, loff_t *);
-	ssize_t (*aio_read) (struct kiocb *, char *, size_t, loff_t);
+	ssize_t (*aio_read) (struct rw_iocb *);
 	ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
-	ssize_t (*aio_write) (struct kiocb *, const char *, size_t, loff_t);
+	ssize_t (*aio_write) (struct rw_iocb *);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
@@ -717,7 +719,7 @@ struct file_operations {
 	int (*flush) (struct file *);
 	int (*release) (struct inode *, struct file *);
 	int (*fsync) (struct file *, struct dentry *, int datasync);
-	int (*aio_fsync) (struct kiocb *, int datasync);
+	int (*aio_fsync) (struct fsync_iocb *);
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
 	ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
@@ -1200,32 +1202,22 @@ extern int generic_file_mmap(struct file
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 extern int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
-extern ssize_t generic_file_read(struct file *, char *, size_t, loff_t *);
 int generic_write_checks(struct inode *inode, struct file *file,
 			loff_t *pos, size_t *count, int isblk);
-extern ssize_t generic_file_write(struct file *, const char *, size_t, loff_t *);
-extern ssize_t generic_file_aio_read(struct kiocb *, char *, size_t, loff_t);
-extern ssize_t generic_file_aio_write(struct kiocb *, const char *, size_t, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
-				unsigned long, loff_t *);
-extern ssize_t do_sync_read(struct file *filp, char *buf, size_t len, loff_t *ppos);
-extern ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos);
-ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos);
+extern ssize_t generic_file_aio_read(struct rw_iocb *);
+extern ssize_t generic_file_aio_write(struct rw_iocb *);
+extern ssize_t generic_file_aio_write_nolock(struct rw_iocb *);
+extern ssize_t FASTCALL(do_sync_rw(struct file *filp, char *buf, size_t len, loff_t *ppos, ssize_t (*op)(struct rw_iocb *), int type));
 extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *);
 extern void do_generic_mapping_read(struct address_space *, struct file_ra_state *, struct file *,
 				    loff_t *, read_descriptor_t *, read_actor_t);
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
-extern ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb,
+extern ssize_t generic_file_direct_IO(int rw, struct rw_iocb *iocb,
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs);
-extern int blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, 
+extern int blockdev_direct_IO(int rw, struct rw_iocb *iocb, struct inode *inode, 
 	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
 	unsigned long nr_segs, get_blocks_t *get_blocks);
-extern ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, 
-	unsigned long nr_segs, loff_t *ppos);
-ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, 
-			unsigned long nr_segs, loff_t *ppos);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
diff -purN linux-2.5/include/net/sock.h aio-2.5/include/net/sock.h
--- linux-2.5/include/net/sock.h	Tue Apr  1 15:18:12 2003
+++ aio-2.5/include/net/sock.h	Mon Mar 24 17:26:21 2003
@@ -302,6 +302,7 @@ static __inline__ void sock_prot_dec_use
 
 /* sock_iocb: used to kick off async processing of socket ios */
 struct sock_iocb {
+	struct rw_iocb		iocb;
 	struct list_head	list;
 
 	int			flags;
@@ -310,18 +311,23 @@ struct sock_iocb {
 	struct sock		*sk;
 	struct scm_cookie	*scm;
 	struct msghdr		*msg, async_msg;
-	struct iovec		async_iov;
 };
 
 static inline struct sock_iocb *kiocb_to_siocb(struct kiocb *iocb)
 {
-	BUG_ON(sizeof(struct sock_iocb) > KIOCB_PRIVATE_SIZE);
-	return (struct sock_iocb *)iocb->private;
+	BUG_ON(sizeof(struct sock_iocb) > sizeof(iocb_t));
+	return (struct sock_iocb *)iocb;
+}
+
+static inline struct sock_iocb *rw_iocb_to_siocb(struct rw_iocb *iocb)
+{
+	BUG_ON(sizeof(struct sock_iocb) > sizeof(iocb_t));
+	return (struct sock_iocb *)iocb;
 }
 
 static inline struct kiocb *siocb_to_kiocb(struct sock_iocb *si)
 {
-	return container_of((void *)si, struct kiocb, private);
+	return &si->iocb.kiocb;
 }
 
 struct socket_alloc {
diff -purN linux-2.5/kernel/ksyms.c aio-2.5/kernel/ksyms.c
--- linux-2.5/kernel/ksyms.c	Tue Apr  1 15:18:13 2003
+++ aio-2.5/kernel/ksyms.c	Mon Mar 24 12:13:45 2003
@@ -226,12 +226,9 @@ EXPORT_SYMBOL(cont_prepare_write);
 EXPORT_SYMBOL(generic_commit_write);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(generic_block_bmap);
-EXPORT_SYMBOL(generic_file_read);
 EXPORT_SYMBOL(generic_file_sendfile);
 EXPORT_SYMBOL(do_generic_mapping_read);
 EXPORT_SYMBOL(file_ra_state_init);
-EXPORT_SYMBOL(generic_file_write);
-EXPORT_SYMBOL(generic_file_write_nolock);
 EXPORT_SYMBOL(generic_file_mmap);
 EXPORT_SYMBOL(generic_file_readonly_mmap);
 EXPORT_SYMBOL(generic_ro_fops);
@@ -356,8 +353,6 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 EXPORT_SYMBOL(read_dev_sector);
 EXPORT_SYMBOL(init_buffer);
 EXPORT_SYMBOL_GPL(generic_file_direct_IO);
-EXPORT_SYMBOL(generic_file_readv);
-EXPORT_SYMBOL(generic_file_writev);
 EXPORT_SYMBOL(iov_shorten);
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
diff -purN linux-2.5/mm/filemap.c aio-2.5/mm/filemap.c
--- linux-2.5/mm/filemap.c	Tue Apr  1 15:18:14 2003
+++ aio-2.5/mm/filemap.c	Mon Mar 24 17:24:19 2003
@@ -711,18 +711,16 @@ success:
  * This is the "read()" routine for all filesystems
  * that can use the page cache directly.
  */
-static ssize_t
-__generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
-		unsigned long nr_segs, loff_t *ppos)
+ssize_t generic_file_aio_read(struct rw_iocb *iocb)
 {
-	struct file *filp = iocb->ki_filp;
+	struct file *filp = iocb->kiocb.ki_filp;
 	ssize_t retval;
-	unsigned long seg;
+	unsigned long seg, nr_segs = iocb->rw_nsegs;
 	size_t count;
 
 	count = 0;
 	for (seg = 0; seg < nr_segs; seg++) {
-		const struct iovec *iv = &iov[seg];
+		const struct iovec *iv = &iocb->rw_iov[seg];
 
 		/*
 		 * If any segment has a negative length, or the cumulative
@@ -742,7 +740,7 @@ __generic_file_aio_read(struct kiocb *io
 
 	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
 	if (filp->f_flags & O_DIRECT) {
-		loff_t pos = *ppos, size;
+		loff_t pos = iocb->rw_pos, size;
 		struct address_space *mapping;
 		struct inode *inode;
 
@@ -754,11 +752,11 @@ __generic_file_aio_read(struct kiocb *io
 		size = inode->i_size;
 		if (pos < size) {
 			retval = generic_file_direct_IO(READ, iocb,
-						iov, pos, nr_segs);
-			if (retval >= 0 && !is_sync_kiocb(iocb))
+						iocb->rw_iov, pos, nr_segs);
+			if (retval >= 0 && !is_sync_kiocb(&iocb->kiocb))
 				retval = -EIOCBQUEUED;
 			if (retval > 0)
-				*ppos = pos + retval;
+				iocb->rw_pos = pos + retval;
 		}
 		UPDATE_ATIME(filp->f_dentry->d_inode);
 		goto out;
@@ -770,12 +768,12 @@ __generic_file_aio_read(struct kiocb *io
 			read_descriptor_t desc;
 
 			desc.written = 0;
-			desc.buf = iov[seg].iov_base;
-			desc.count = iov[seg].iov_len;
+			desc.buf = iocb->rw_iov[seg].iov_base;
+			desc.count = iocb->rw_iov[seg].iov_len;
 			if (desc.count == 0)
 				continue;
 			desc.error = 0;
-			do_generic_file_read(filp,ppos,&desc,file_read_actor);
+			do_generic_file_read(filp,&iocb->rw_pos,&desc,file_read_actor);
 			retval += desc.written;
 			if (!retval) {
 				retval = desc.error;
@@ -787,30 +785,8 @@ out:
 	return retval;
 }
 
-ssize_t
-generic_file_aio_read(struct kiocb *iocb, char *buf, size_t count, loff_t pos)
-{
-	struct iovec local_iov = { .iov_base = buf, .iov_len = count };
-
-	BUG_ON(iocb->ki_pos != pos);
-	return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos);
-}
 EXPORT_SYMBOL(generic_file_aio_read);
 
-ssize_t
-generic_file_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
-{
-	struct iovec local_iov = { .iov_base = buf, .iov_len = count };
-	struct kiocb kiocb;
-	ssize_t ret;
-
-	init_sync_kiocb(&kiocb, filp);
-	ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
-	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&kiocb);
-	return ret;
-}
-
 int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
 {
 	ssize_t written;
@@ -1572,11 +1548,9 @@ EXPORT_SYMBOL(generic_write_checks);
  * it for writing by marking it dirty.
  *							okir@monad.swb.de
  */
-ssize_t
-generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+ssize_t generic_file_aio_write_nolock(struct rw_iocb *iocb)
 {
-	struct file *file = iocb->ki_filp;
+	struct file *file = iocb->kiocb.ki_filp;
 	struct address_space * mapping = file->f_dentry->d_inode->i_mapping;
 	struct address_space_operations *a_ops = mapping->a_ops;
 	size_t ocount;		/* original count */
@@ -1591,14 +1565,14 @@ generic_file_aio_write_nolock(struct kio
 	ssize_t		err;
 	size_t		bytes;
 	struct pagevec	lru_pvec;
-	const struct iovec *cur_iov = iov; /* current iovec */
+	const struct iovec *cur_iov = iocb->rw_iov; /* current iovec */
 	size_t		iov_base = 0;	   /* offset in the current iovec */
-	unsigned long	seg;
+	unsigned long	seg, nr_segs = iocb->rw_nsegs;
 	char		*buf;
 
 	ocount = 0;
 	for (seg = 0; seg < nr_segs; seg++) {
-		const struct iovec *iv = &iov[seg];
+		const struct iovec *iv = &iocb->rw_iov[seg];
 
 		/*
 		 * If any segment has a negative length, or the cumulative
@@ -1617,7 +1591,7 @@ generic_file_aio_write_nolock(struct kio
 	}
 
 	count = ocount;
-	pos = *ppos;
+	pos = iocb->rw_pos;
 	pagevec_init(&lru_pvec, 0);
 
 	/* We can write back this queue in page reclaim */
@@ -1638,17 +1612,17 @@ generic_file_aio_write_nolock(struct kio
 	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
 	if (unlikely(file->f_flags & O_DIRECT)) {
 		if (count != ocount)
-			nr_segs = iov_shorten((struct iovec *)iov,
+			nr_segs = iov_shorten(iocb->rw_iov,
 						nr_segs, count);
 		written = generic_file_direct_IO(WRITE, iocb,
-					iov, pos, nr_segs);
+					iocb->rw_iov, pos, nr_segs);
 		if (written > 0) {
 			loff_t end = pos + written;
 			if (end > inode->i_size && !isblk) {
 				inode->i_size = end;
 				mark_inode_dirty(inode);
 			}
-			*ppos = end;
+			iocb->rw_pos = end;
 		}
 		/*
 		 * Sync the fs metadata but not the minor inode changes and
@@ -1656,12 +1630,12 @@ generic_file_aio_write_nolock(struct kio
 		 */
 		if (written >= 0 && file->f_flags & O_SYNC)
 			status = generic_osync_inode(inode, OSYNC_METADATA);
-		if (written >= 0 && !is_sync_kiocb(iocb))
+		if (written >= 0 && !is_sync_kiocb(&iocb->kiocb))
 			written = -EIOCBQUEUED;
 		goto out_status;
 	}
 
-	buf = iov->iov_base;
+	buf = iocb->rw_iov->iov_base;
 	do {
 		unsigned long index;
 		unsigned long offset;
@@ -1732,7 +1706,7 @@ generic_file_aio_write_nolock(struct kio
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 	} while (count);
-	*ppos = pos;
+	iocb->rw_pos = pos;
 
 	if (cached_page)
 		page_cache_release(cached_page);
@@ -1754,84 +1728,23 @@ out:
 	return err;
 }
 
-ssize_t
-generic_file_write_nolock(struct file *file, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
-{
-	struct kiocb kiocb;
-	ssize_t ret;
-
-	init_sync_kiocb(&kiocb, file);
-	ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos);
-	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&kiocb);
-	return ret;
-}
-
-ssize_t generic_file_aio_write(struct kiocb *iocb, const char *buf,
-			       size_t count, loff_t pos)
+ssize_t generic_file_aio_write(struct rw_iocb *iocb)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
 	ssize_t err;
-	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
-	BUG_ON(iocb->ki_pos != pos);
-
-	down(&inode->i_sem);
-	err = generic_file_aio_write_nolock(iocb, &local_iov, 1, 
-						&iocb->ki_pos);
-	up(&inode->i_sem);
 
+	down(&iocb->kiocb.ki_filp->f_dentry->d_inode->i_sem);
+	err = generic_file_aio_write_nolock(iocb);
+	up(&iocb->kiocb.ki_filp->f_dentry->d_inode->i_sem);
 	return err;
 }
 EXPORT_SYMBOL(generic_file_aio_write);
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
-ssize_t generic_file_write(struct file *file, const char *buf,
-			   size_t count, loff_t *ppos)
-{
-	struct inode	*inode = file->f_dentry->d_inode->i_mapping->host;
-	ssize_t		err;
-	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
-	down(&inode->i_sem);
-	err = generic_file_write_nolock(file, &local_iov, 1, ppos);
-	up(&inode->i_sem);
-
-	return err;
-}
-
-ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
-			unsigned long nr_segs, loff_t *ppos)
-{
-	struct kiocb kiocb;
-	ssize_t ret;
-
-	init_sync_kiocb(&kiocb, filp);
-	ret = __generic_file_aio_read(&kiocb, iov, nr_segs, ppos);
-	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&kiocb);
-	return ret;
-}
-
-ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
-			unsigned long nr_segs, loff_t * ppos) 
-{
-	struct inode *inode = file->f_dentry->d_inode;
-	ssize_t ret;
-
-	down(&inode->i_sem);
-	ret = generic_file_write_nolock(file, iov, nr_segs, ppos);
-	up(&inode->i_sem);
-	return ret;
-}
-
 ssize_t
-generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+generic_file_direct_IO(int rw, struct rw_iocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
+	struct file *file = iocb->kiocb.ki_filp;
 	struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
 	ssize_t retval;
 
diff -purN linux-2.5/net/socket.c aio-2.5/net/socket.c
--- linux-2.5/net/socket.c	Tue Apr  1 15:18:19 2003
+++ aio-2.5/net/socket.c	Tue Mar 25 14:29:33 2003
@@ -95,10 +95,8 @@
 #include <linux/netfilter.h>
 
 static int sock_no_open(struct inode *irrelevant, struct file *dontcare);
-static ssize_t sock_aio_read(struct kiocb *iocb, char *buf,
-			 size_t size, loff_t pos);
-static ssize_t sock_aio_write(struct kiocb *iocb, const char *buf,
-			  size_t size, loff_t pos);
+static ssize_t sock_aio_read(struct rw_iocb *iocb);
+static ssize_t sock_aio_write(struct rw_iocb *iocb);
 static int sock_mmap(struct file *file, struct vm_area_struct * vma);
 
 static int sock_close(struct inode *inode, struct file *file);
@@ -528,6 +526,7 @@ static inline int __sock_sendmsg(struct 
 	si->scm = NULL;
 	si->msg = msg;
 	si->size = size;
+	si->flags = 0;
 
 	err = security_socket_sendmsg(sock, msg, size);
 	if (err)
@@ -538,13 +537,14 @@ static inline int __sock_sendmsg(struct 
 
 int sock_sendmsg(struct socket *sock, struct msghdr *msg, int size)
 {
-	struct kiocb iocb;
+	struct sync_iocb iocb;
 	int ret;
 
-	init_sync_kiocb(&iocb, NULL);
-	ret = __sock_sendmsg(&iocb, sock, msg, size);
+	init_sync_kiocb(&iocb.kiocb, NULL);
+	
+	ret = __sock_sendmsg(&iocb.kiocb, sock, msg, size);
 	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&iocb);
+		ret = wait_on_sync_kiocb(&iocb.kiocb);
 	return ret;
 }
 
@@ -569,13 +569,13 @@ static inline int __sock_recvmsg(struct 
 
 int sock_recvmsg(struct socket *sock, struct msghdr *msg, int size, int flags)
 {
-	struct kiocb iocb;
+	struct sync_iocb iocb;
 	int ret;
 
-        init_sync_kiocb(&iocb, NULL);
-	ret = __sock_recvmsg(&iocb, sock, msg, size, flags);
+        init_sync_kiocb(&iocb.kiocb, NULL);
+	ret = __sock_recvmsg(&iocb.kiocb, sock, msg, size, flags);
 	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&iocb);
+		ret = wait_on_sync_kiocb(&iocb.kiocb);
 	return ret;
 }
 
@@ -584,31 +584,29 @@ int sock_recvmsg(struct socket *sock, st
  *	area ubuf...ubuf+size-1 is writable before asking the protocol.
  */
 
-static ssize_t sock_aio_read(struct kiocb *iocb, char *ubuf,
-			 size_t size, loff_t pos)
+static ssize_t sock_aio_read(struct rw_iocb *iocb)
 {
-	struct sock_iocb *x = kiocb_to_siocb(iocb);
+	struct sock_iocb *x = rw_iocb_to_siocb(iocb);
 	struct socket *sock;
 	int flags;
 
-	if (pos != 0)
+	if (x->iocb.rw_pos != 0)
 		return -ESPIPE;
-	if (size==0)		/* Match SYS5 behaviour */
+	if (0 == x->iocb.rw_local_iov.iov_len)
 		return 0;
 
-	sock = SOCKET_I(iocb->ki_filp->f_dentry->d_inode); 
+	sock = SOCKET_I(x->iocb.kiocb.ki_filp->f_dentry->d_inode); 
 
 	x->async_msg.msg_name = NULL;
 	x->async_msg.msg_namelen = 0;
-	x->async_msg.msg_iov = &x->async_iov;
-	x->async_msg.msg_iovlen = 1;
+	x->async_msg.msg_iov = x->iocb.rw_iov;
+	x->async_msg.msg_iovlen = x->iocb.rw_nsegs;
 	x->async_msg.msg_control = NULL;
 	x->async_msg.msg_controllen = 0;
-	x->async_iov.iov_base = ubuf;
-	x->async_iov.iov_len = size;
-	flags = !(iocb->ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
+	flags = !(x->iocb.kiocb.ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
 
-	return __sock_recvmsg(iocb, sock, &x->async_msg, size, flags);
+	return __sock_recvmsg(&x->iocb.kiocb, sock, &x->async_msg,
+			      x->iocb.rw_local_iov.iov_len, flags);
 }
 
 
@@ -617,32 +615,29 @@ static ssize_t sock_aio_read(struct kioc
  *	is readable by the user process.
  */
 
-static ssize_t sock_aio_write(struct kiocb *iocb, const char *ubuf,
-			  size_t size, loff_t pos)
+static ssize_t sock_aio_write(struct rw_iocb *iocb)
 {
-	struct sock_iocb *x = kiocb_to_siocb(iocb);
+	struct sock_iocb *x = rw_iocb_to_siocb(iocb);
 	struct socket *sock;
 	
-	if (pos != 0)
+	if (x->iocb.rw_pos != 0)
 		return -ESPIPE;
-	if(size==0)		/* Match SYS5 behaviour */
+	if (0 == x->iocb.rw_local_iov.iov_len)
 		return 0;
 
-	sock = SOCKET_I(iocb->ki_filp->f_dentry->d_inode); 
+	sock = SOCKET_I(x->iocb.kiocb.ki_filp->f_dentry->d_inode); 
 
 	x->async_msg.msg_name = NULL;
 	x->async_msg.msg_namelen = 0;
-	x->async_msg.msg_iov = &x->async_iov;
-	x->async_msg.msg_iovlen = 1;
+	x->async_msg.msg_iov = x->iocb.rw_iov;
+	x->async_msg.msg_iovlen = x->iocb.rw_nsegs;
 	x->async_msg.msg_control = NULL;
 	x->async_msg.msg_controllen = 0;
-	x->async_msg.msg_flags = !(iocb->ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
+	x->async_msg.msg_flags = !(x->iocb.kiocb.ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
 	if (sock->type == SOCK_SEQPACKET)
 		x->async_msg.msg_flags |= MSG_EOR;
-	x->async_iov.iov_base = (void *)ubuf;
-	x->async_iov.iov_len = size;
 	
-	return __sock_sendmsg(iocb, sock, &x->async_msg, size);
+	return __sock_sendmsg(&iocb->kiocb, sock, &x->async_msg, x->iocb.rw_local_iov.iov_len);
 }
 
 ssize_t sock_sendpage(struct file *file, struct page *page,

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Patch 2/2] Retry based aio read - filesystem read changes
  2003-03-31 19:16           ` Benjamin LaHaise
  2003-03-31 19:07             ` Janet Morgan
  2003-03-31 19:17             ` William Lee Irwin III
@ 2003-04-07  3:51             ` Suparna Bhattacharya
  2 siblings, 0 replies; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-04-07  3:51 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: William Lee Irwin III, Janet Morgan, akpm, linux-aio, linux-kernel

On Mon, Mar 31, 2003 at 02:16:29PM -0500, Benjamin LaHaise wrote:
> On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
> > Can you tell whether these are due to hash collisions or contention on
> > the same page?
> 
> No, they're most likely waiting for io to complete.
> 
> To clean this up I've got a patch to move from aio_read/write with all the 
> parameters to a single parameter based rw-specific iocb.  That makes the 
> retry for read and write more ameniable to sharing common logic akin to the 
> wtd_ ops, which we need at the very least for the semaphore operations.

Do you also have a patch for handling semaphore operations ?

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-04-07  3:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-05  9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
2003-03-05  9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
2003-03-14 13:23   ` Suparna Bhattacharya
2003-03-05  9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
2003-03-05 10:42   ` Andrew Morton
2003-03-05 12:14     ` Suparna Bhattacharya
2003-03-31 18:32       ` Janet Morgan
2003-03-31 19:11         ` William Lee Irwin III
2003-03-31 19:16           ` Benjamin LaHaise
2003-03-31 19:07             ` Janet Morgan
2003-04-01 20:24               ` Benjamin LaHaise
2003-03-31 19:17             ` William Lee Irwin III
2003-03-31 19:25               ` Benjamin LaHaise
2003-04-07  3:51             ` Suparna Bhattacharya
2003-03-05 23:00 ` [RFC][Patch] Retry based aio read for filesystems Janet Morgan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).