linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Filesystem AIO read-write patches
@ 2003-04-24  4:52 Suparna Bhattacharya
  2003-04-24  5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  4:52 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

Here is a revised version of the filesystem AIO patches
for 2.5.68.

It is built on a variation of the simple retry based 
scheme originally suggested by Ben LaHaise. 

Why ?
------
Because 2.5 is still missing real support for regular 
filesystem AIO (but for O_DIRECT). 

ext2, jfs and nfs define the fops aio interfaces aio_read 
and aio_write to default to generic_file_aio_read/write.
However these routines show fully synchronous behaviour
unless the file was opened with O_DIRECT. This means that 
an io_submit could merrily block for regular aio read/write
operations, while an application thinks its doing async 
i/o.

How ?
------
The approach we took was to identify and focus on
the most significant blocking points (seconded by 
observations from initial experimentation and profiling 
results), and convert them to retry exit points. 

Retries start at a very high level, driven directly by 
the aio infrastructure (In future if the in-kernel fs
apis change, then retries could be modified to happen  
one level below, i.e. at the api level). They are kicked 
off via async wait queue functions. In synchronous i/o 
context the default wait queue entries are synchronous 
hence don't cause an exit at a retry point.

One of the considerations was to try to take a careful
and less intrusive route with minimal changes to existing
synchronous i/o paths. The intent was to achieve a 
reasonable level of asynchrony in a way that could then
be further optimized and tuned for workloads of relevance.

The Patches:
-----------
(which I'll be mailing out as responses to this note)
01aioretry.patch 	: Base aio retry infrastructure 
02aiord.patch	 	: Filesystem aio read 
03aiowr.patch	 	: Minimal filesystem aio write 
			(for all archs and all filesystems 
			 using generic_file_aio_write)
04down_wq-86.patch 	: An asynchronous semaphore down
		     	implementation (currently x86 
			only)
05aiowrdown_wq.patch 	: Uses async down for aio write
06bread_wq.patch	: Async bread implementation
07ext2getblk_wq.patch	: Async get block support for 
			  the ext2 filesystem

Observations
--------------
As a quick check to find out if this really works, I could
observe a decent reduction in the time spent in io_submit 
(especially for large reads) when the file is not already 
cached (e.g. first time access). For the write case, I 
found that I had to add the async get block support 
to get a perceptable benefit. For the cached case, there
wasn't any observable difference, which is expected.
The patch didn't seem to be hurting synchronous read/
write performance for a simple test.

Another thing I tried out was to temporarily move the 
retries into io_getevents rather than worker threads
just as a sanity check for any gross impact on cpu 
utilization. That seemed OK too. 

Of course thorough performance testing is needed and
would show up places where there is scope for 
tuning, and how it affects overall system performance
numbers.

I have been playing with it for a while now, and so 
far its been running OK for me. 

I would welcome feedback, bug reports, test results etc.

Full diffstat:

aiordwr-rollup.patch:
......................
 arch/i386/kernel/i386_ksyms.c |    2 
 arch/i386/kernel/semaphore.c  |   30 ++-
 drivers/block/ll_rw_blk.c     |   21 +-
 fs/aio.c                      |  371 +++++++++++++++++++++++++++++++++---------
 fs/buffer.c                   |   54 +++++-
 fs/ext2/inode.c               |   44 +++-
 include/asm-i386/semaphore.h  |   27 ++-
 include/linux/aio.h           |   32 +++
 include/linux/blkdev.h        |    1 
 include/linux/buffer_head.h   |   30 +++
 include/linux/errno.h         |    1 
 include/linux/init_task.h     |    1 
 include/linux/pagemap.h       |   19 ++
 include/linux/sched.h         |    2 
 include/linux/wait.h          |    2 
 include/linux/writeback.h     |    4 
 kernel/fork.c                 |    9 -
 mm/filemap.c                  |   97 +++++++++-
 mm/page-writeback.c           |   17 +
 19 files changed, 616 insertions(+), 148 deletions(-)

[The patches are also available for download on the
 Linux Scalability Effort project site
(http://sourceforge.net/projects/lse)
Categorized under the "aio" release in IO Scalability 
section
http://sourceforge.net/project/showfiles.php?group_id=8875]

A rollup version containing all the 7 patches 
(aiordwr-rollup.patch) would be made available as well

Major additions/changes since previous versions posted:
------------------------------------------------------
- Introduced _wq versions of low level routines like
  lock_page_wq, wait_on_page_bit_wq etc, which take the 
  wait_queue entry as a parameter (Thanks to Christoph
  Hellwig for suggesting the new and much better 
  names :)). 
- Reorganized code to avoid having to use the do_sync_op() 
  wrapper (because the forced emulation of the i/o wait 
  context seemed an overhead and not very elegant).
- (New)Implementation of asynchronous semaphore down 
  operation for x86 (down_wq).
- Have dropped the async block allocation portions from the
  async ext2_get_block patch after a discussion with Stephen
  Tweedie (the i/o patterns we anticipate are less likely
  to extend file sizes)
- Fixes use_mm() to clear lazy tlb setting (I traced some 
  of the strange hangs I was seeing for large reads to this)
- Removed the aio_run_iocbs() acceleration from io_getevents,
  now that the above problem is gone.

Todos/TBDs:
----------
- Support for down_wq on other archs or provide compatibility 
  definitions for archs where it is not implemented
  (Need feedback on this)
- Should the cond_resched() calls in read/write be 
  converted to retry points (would need ctx specific worker 
  threads) ?
- Look at async get block implementations for other 
  filesystems (e.g. jfs) ?
- Optional: Check if it makes sense to use retry model for 
  o_direct (or change sync o_direct to wait for completion 
  of async o_direct) ? 
- Upgrade to Ben's aio api changes (collapse of api parameters
  into an rw_iocb) if and when it gets merged

A few comments on low level implementation details:
--------------------------------------------------
io_wait context
-------------------
The task->io_wait field reflects the wait context in which
a task is executing its i/o operations. For synchronous i/o
task->io_wait is NULL, and the wait context is local on
stack; for threads doing io submits or retries on behalf 
of async i/o callers, tsk->io_wait is the wait queue 
function entry to be notified on completion of a condition 
required for i/o to progress. 

Low level _wq routines take a wait queue parameter, so
they can be invoked in either async or sync mode, even
if running in async context (e.g servicing a page fault
during an async retry). 

Routines which are expected to be async whenever they are 
running in async context and sync when running in sync 
context do not need to provide a wait queue parameter.

do_sync_op()
---------------
The do_sync_op() wrappers are not typically needed anymore
for sync versions of the operations; passing NULL to the
corresponding _wq functions suffices. 

However, there may be weird cases where we may have several 
levels of nesting like:
A()->B()->C()->D()->F()->iowait() 
It may seem unnatural to pass a wait queue argument all 
the way through, but if we need to force sync behaviour 
in a certain case even if it is called under async context, 
and have async behaviour in another, then we may need to 
resort to using do_sync_op() (e.g if we had kept the 
ext2 async block allocation modifications). 

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
@ 2003-04-24  5:00 ` Suparna Bhattacharya
  2003-04-24  5:04 ` [PATCH 2/7] Filesystem AIO rdwr - aio read Suparna Bhattacharya
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:00 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> It is built on a variation of the simple retry based 
> scheme originally suggested by Ben LaHaise. 
> 
> 01aioretry.patch 	: Base aio retry infrastructure 
> 

 fs/aio.c                  |  371 ++++++++++++++++++++++++++++++++++++----------
 include/linux/aio.h       |   32 +++
 include/linux/errno.h     |    1 
 include/linux/init_task.h |    1 
 include/linux/sched.h     |    2 
 include/linux/wait.h      |    2 
 kernel/fork.c             |    9 -
 7 files changed, 337 insertions(+), 81 deletions(-)


diff -ur -X /home/kiran/dontdiff linux-2.5.68/fs/aio.c linux-aio-2568/fs/aio.c
--- linux-2.5.68/fs/aio.c	Fri Apr 11 21:10:48 2003
+++ linux-aio-2568/fs/aio.c	Wed Apr 23 17:15:30 2003
@@ -39,6 +39,9 @@
 #define dprintk(x...)	do { ; } while (0)
 #endif
 
+long aio_run = 0; /* for testing only */
+long aio_wakeups = 0; /* for testing only */
+
 /*------ sysctl variables----*/
 atomic_t aio_nr = ATOMIC_INIT(0);	/* current system wide number of aio requests */
 unsigned aio_max_nr = 0x10000;	/* system wide maximum number of aio requests */
@@ -281,6 +284,7 @@
 		struct kiocb *iocb = list_kiocb(pos);
 		list_del_init(&iocb->ki_list);
 		cancel = iocb->ki_cancel;
+		kiocbSetCancelled(iocb);
 		if (cancel) {
 			iocb->ki_users++;
 			spin_unlock_irq(&ctx->ctx_lock);
@@ -395,6 +400,7 @@
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 	req->ki_user_obj = NULL;
+	INIT_LIST_HEAD(&req->ki_run_list);
 
 	/* Check if the completion queue has enough free space to
 	 * accept an event from this io.
@@ -544,60 +550,147 @@
 	struct mm_struct *active_mm = current->active_mm;
 	atomic_inc(&mm->mm_count);
 	current->mm = mm;
-	if (mm != active_mm) {
-		current->active_mm = mm;
-		activate_mm(active_mm, mm);
-	}
+
+	current->active_mm = mm;
+	activate_mm(active_mm, mm);
+
 	mmdrop(active_mm);
 }
 
-static void unuse_mm(struct mm_struct *mm)
+void unuse_mm(struct mm_struct *mm)
 {
 	current->mm = NULL;
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, current, smp_processor_id());
 }
 
-/* Run on kevent's context.  FIXME: needs to be per-cpu and warn if an
- * operation blocks.
- */
-static void aio_kick_handler(void *data)
+static inline int __queue_kicked_iocb(struct kiocb *iocb)
 {
-	struct kioctx *ctx = data;
+	struct kioctx	*ctx = iocb->ki_ctx;
 
-	use_mm(ctx->mm);
+	if (list_empty(&iocb->ki_run_list)) {
+		list_add_tail(&iocb->ki_run_list, 
+			&ctx->run_list);
+		iocb->ki_queued++;
+		return 1;
+	}
+	return 0;
+}
 
-	spin_lock_irq(&ctx->ctx_lock);
-	while (!list_empty(&ctx->run_list)) {
-		struct kiocb *iocb;
-		long ret;
+/* Expects to be called with iocb->ki_ctx->lock held */
+static ssize_t aio_run_iocb(struct kiocb *iocb)
+{
+	struct kioctx	*ctx = iocb->ki_ctx;
+	ssize_t (*retry)(struct kiocb *);
+	ssize_t ret;
 
-		iocb = list_entry(ctx->run_list.next, struct kiocb,
-				  ki_run_list);
-		list_del(&iocb->ki_run_list);
-		iocb->ki_users ++;
-		spin_unlock_irq(&ctx->ctx_lock);
+	if (iocb->ki_retried++ > 1024*1024) {
+		printk("Maximal retry count. Bytes done %d\n",
+			iocb->ki_nbytes - iocb->ki_left);
+		return -EAGAIN;
+	}
 
-		kiocbClearKicked(iocb);
-		ret = iocb->ki_retry(iocb);
+	if (!(iocb->ki_retried & 0xff)) {
+		printk("%ld retry: %d of %d (kick %ld, Q %ld run %ld, wake %ld)\n",
+			iocb->ki_retried, 
+			iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes,
+			iocb->ki_kicked, iocb->ki_queued, aio_run, aio_wakeups);
+	}
+
+	if (!(retry = iocb->ki_retry)) {
+		printk("aio_run_iocb: iocb->ki_retry = NULL\n");
+		return 0;
+	}
+
+	iocb->ki_users ++;
+	kiocbClearKicked(iocb);
+	iocb->ki_run_list.next = iocb->ki_run_list.prev = NULL;
+	iocb->ki_retry = NULL;	
+	spin_unlock_irq(&ctx->ctx_lock);
+	
+	if (kiocbIsCancelled(iocb)) {
+		aio_complete(iocb, -EINTR, 0);
+		spin_lock_irq(&ctx->ctx_lock);
+		__aio_put_req(ctx, iocb);
+		return -EINTR;
+	}
+
+	BUG_ON(current->io_wait != NULL);
+	current->io_wait = &iocb->ki_wait;
+	ret = retry(iocb);
+	current->io_wait = NULL;
+
+	if (-EIOCBRETRY != ret) {
 		if (-EIOCBQUEUED != ret) {
+			BUG_ON(!list_empty(&iocb->ki_wait.task_list)); 
 			aio_complete(iocb, ret, 0);
-			iocb = NULL;
 		}
+	} else {
+		if (list_empty(&iocb->ki_wait.task_list)) 
+			kiocbSetKicked(iocb);
+	}
+	spin_lock_irq(&ctx->ctx_lock);
 
-		spin_lock_irq(&ctx->ctx_lock);
-		if (NULL != iocb)
-			__aio_put_req(ctx, iocb);
+	iocb->ki_retry = retry;
+	INIT_LIST_HEAD(&iocb->ki_run_list);
+	if (kiocbIsKicked(iocb)) {
+		BUG_ON(ret != -EIOCBRETRY);
+		__queue_kicked_iocb(iocb);
+	} 
+	__aio_put_req(ctx, iocb);
+	return ret;
+}
+
+static void aio_run_iocbs(struct kioctx *ctx)
+{
+	struct kiocb *iocb;
+	ssize_t ret;
+	int count = 0;
+
+	spin_lock_irq(&ctx->ctx_lock);
+	while (!list_empty(&ctx->run_list)) {
+		iocb = list_entry(ctx->run_list.next, struct kiocb,
+			ki_run_list);
+		list_del(&iocb->ki_run_list);
+		ret = aio_run_iocb(iocb);
+		count++;
 	}
 	spin_unlock_irq(&ctx->ctx_lock);
+	aio_run++;
+}
+
+/* Run on aiod/kevent's context.  FIXME: needs to be per-cpu and warn if an
+ * operation blocks.
+ */
+static void aio_kick_handler(void *data)
+{
+	struct kioctx *ctx = data;
 
+	use_mm(ctx->mm);
+	aio_run_iocbs(ctx);
 	unuse_mm(ctx->mm);
 }
 
-void kick_iocb(struct kiocb *iocb)
+
+void queue_kicked_iocb(struct kiocb *iocb)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
+	unsigned long flags;
+	int run = 0;
+
+	WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
 
+	spin_lock_irqsave(&ctx->ctx_lock, flags);
+	run = __queue_kicked_iocb(iocb);
+	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+	if (run) {
+		queue_work(aio_wq, &ctx->wq);
+		aio_wakeups++;
+	}
+}
+
+void kick_iocb(struct kiocb *iocb)
+{
 	/* sync iocbs are easy: they can only ever be executing from a 
 	 * single context. */
 	if (is_sync_kiocb(iocb)) {
@@ -606,12 +699,9 @@
 		return;
 	}
 
+	iocb->ki_kicked++;
 	if (!kiocbTryKick(iocb)) {
-		unsigned long flags;
-		spin_lock_irqsave(&ctx->ctx_lock, flags);
-		list_add_tail(&iocb->ki_run_list, &ctx->run_list);
-		spin_unlock_irqrestore(&ctx->ctx_lock, flags);
-		schedule_work(&ctx->wq);
+		queue_kicked_iocb(iocb);
 	}
 }
 
@@ -664,6 +754,9 @@
 	 */
 	spin_lock_irqsave(&ctx->ctx_lock, flags);
 
+	if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
+		list_del_init(&iocb->ki_run_list);
+
 	ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
 
 	tail = info->tail;
@@ -693,6 +786,11 @@
 
 	pr_debug("added to ring %p at [%lu]\n", iocb, tail);
 
+	pr_debug("%ld retries: %d of %d (kicked %ld, Q %ld run %ld wake %ld)\n",
+		iocb->ki_retried, 
+		iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes,
+		iocb->ki_kicked, iocb->ki_queued, aio_run, aio_wakeups);
+
 	/* everything turned out well, dispose of the aiocb. */
 	ret = __aio_put_req(ctx, iocb);
 
@@ -807,6 +905,7 @@
 	int			i = 0;
 	struct io_event		ent;
 	struct timeout		to;
+	int 			event_loop = 0; /* testing only */
 
 	/* needed to zero any padding within an entry (there shouldn't be 
 	 * any, but C is fun!
@@ -856,7 +955,6 @@
 		add_wait_queue_exclusive(&ctx->wait, &wait);
 		do {
 			set_task_state(tsk, TASK_INTERRUPTIBLE);
-
 			ret = aio_read_evt(ctx, &ent);
 			if (ret)
 				break;
@@ -866,6 +964,7 @@
 			if (to.timed_out)	/* Only check after read evt */
 				break;
 			schedule();
+			event_loop++;
 			if (signal_pending(tsk)) {
 				ret = -EINTR;
 				break;
@@ -893,6 +992,9 @@
 	if (timeout)
 		clear_timeout(&to);
 out:
+	pr_debug("event loop executed %d times\n", event_loop);
+	pr_debug("aio_run %ld\n", aio_run);
+	pr_debug("aio_wakeups %ld\n", aio_wakeups);
 	return i ? i : ret;
 }
 
@@ -984,6 +1086,143 @@
 	return -EINVAL;
 }
 
+ssize_t aio_pread(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+
+	ret = file->f_op->aio_read(iocb, iocb->ki_buf,
+		iocb->ki_left, iocb->ki_pos);
+
+	/*
+	 * Can't just depend on iocb->ki_left to determine 
+	 * whether we are done. This may have been a short read.
+	 */
+	if (ret > 0) {
+		iocb->ki_buf += ret;
+		iocb->ki_left -= ret;
+
+		ret = -EIOCBRETRY;
+	}
+
+	/* This means we must have transferred all that we could */
+	/* No need to retry anymore */
+	if ((ret == 0) || (iocb->ki_left == 0)) 
+		ret = iocb->ki_nbytes - iocb->ki_left;
+
+	return ret;
+}
+
+ssize_t aio_pwrite(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+
+	ret = file->f_op->aio_write(iocb, iocb->ki_buf,
+		iocb->ki_left, iocb->ki_pos);
+
+	/* 
+	 * TBD: Even if iocb->ki_left = 0, could we need to 
+	 * wait for data to be sync'd ? Or can we assume
+	 * that aio_fdsync/aio_fsync would be called explicitly
+	 * as required.
+	 */
+	if (ret > 0) {
+		iocb->ki_buf += ret;
+		iocb->ki_left -= ret;
+
+		ret = -EIOCBRETRY;
+	}
+
+	/* This means we must have transferred all that we could */
+	/* No need to retry anymore */
+	if (ret == 0) 
+		ret = iocb->ki_nbytes - iocb->ki_left;
+
+	return ret;
+}
+
+ssize_t aio_fdsync(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = -EINVAL;
+
+	if (file->f_op->aio_fsync)
+		ret = file->f_op->aio_fsync(iocb, 1);
+	return ret;
+}
+	
+ssize_t aio_fsync(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = -EINVAL;
+
+	if (file->f_op->aio_fsync)
+		ret = file->f_op->aio_fsync(iocb, 0);
+	return ret;
+}
+	
+/* Called during initial submission and subsequent retry operations */
+ssize_t aio_setup_iocb(struct kiocb *iocb)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret = 0;
+	
+	switch (iocb->ki_opcode) {
+	case IOCB_CMD_PREAD:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_READ)))
+			break;
+		ret = -EFAULT;
+		if (unlikely(!access_ok(VERIFY_WRITE, iocb->ki_buf, 
+			iocb->ki_left)))
+			break;
+		ret = -EINVAL;
+		if (file->f_op->aio_read)
+			iocb->ki_retry = aio_pread;
+		break;
+	case IOCB_CMD_PWRITE:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_WRITE)))
+			break;
+		ret = -EFAULT;
+		if (unlikely(!access_ok(VERIFY_READ, iocb->ki_buf, 
+			iocb->ki_left)))
+			break;
+		ret = -EINVAL;
+		if (file->f_op->aio_write)
+			iocb->ki_retry = aio_pwrite;
+		break;
+	case IOCB_CMD_FDSYNC:
+		ret = -EINVAL;
+		if (file->f_op->aio_fsync)
+			iocb->ki_retry = aio_fdsync;
+		break;
+	case IOCB_CMD_FSYNC:
+		ret = -EINVAL;
+		if (file->f_op->aio_fsync)
+			iocb->ki_retry = aio_fsync;
+		break;
+	default:
+		dprintk("EINVAL: io_submit: no operation provided\n");
+		ret = -EINVAL;
+	}
+
+	if (!iocb->ki_retry)
+		return ret;
+
+	return 0;
+}
+
+int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+{
+	struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+
+	list_del_init(&wait->task_list);
+	kick_iocb(iocb);
+	return 1;
+}
+
 int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
 				  struct iocb *iocb));
 int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -992,7 +1231,6 @@
 	struct kiocb *req;
 	struct file *file;
 	ssize_t ret;
-	char *buf;
 
 	/* enforce forwards compatibility on users */
 	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1033,51 +1271,31 @@
 	req->ki_user_data = iocb->aio_data;
 	req->ki_pos = iocb->aio_offset;
 
-	buf = (char *)(unsigned long)iocb->aio_buf;
+	req->ki_buf = (char *)(unsigned long)iocb->aio_buf;
+	req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+	req->ki_opcode = iocb->aio_lio_opcode;
+	init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
+	INIT_LIST_HEAD(&req->ki_wait.task_list);
+	req->ki_run_list.next = req->ki_run_list.prev = NULL;
+	req->ki_retry = NULL;
+	req->ki_retried = 0;
+	req->ki_kicked = 0;
+	req->ki_queued = 0;
+	aio_run = 0;
+	aio_wakeups = 0;
 
-	switch (iocb->aio_lio_opcode) {
-	case IOCB_CMD_PREAD:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_READ)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_read)
-			ret = file->f_op->aio_read(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
-		break;
-	case IOCB_CMD_PWRITE:
-		ret = -EBADF;
-		if (unlikely(!(file->f_mode & FMODE_WRITE)))
-			goto out_put_req;
-		ret = -EFAULT;
-		if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
-			goto out_put_req;
-		ret = -EINVAL;
-		if (file->f_op->aio_write)
-			ret = file->f_op->aio_write(req, buf,
-					iocb->aio_nbytes, req->ki_pos);
-		break;
-	case IOCB_CMD_FDSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 1);
-		break;
-	case IOCB_CMD_FSYNC:
-		ret = -EINVAL;
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(req, 0);
-		break;
-	default:
-		dprintk("EINVAL: io_submit: no operation provided\n");
-		ret = -EINVAL;
-	}
+	ret = aio_setup_iocb(req);
+
+	if ((-EBADF == ret) || (-EFAULT == ret))
+		goto out_put_req;
+
+	spin_lock_irq(&ctx->ctx_lock);
+	ret = aio_run_iocb(req);
+	spin_unlock_irq(&ctx->ctx_lock);
+
+	if (-EIOCBRETRY == ret)
+		queue_work(aio_wq, &ctx->wq);
 
-	if (likely(-EIOCBQUEUED == ret))
-		return 0;
-	aio_complete(req, ret, 0);
 	return 0;
 
 out_put_req:
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/aio.h linux-aio-2568/include/linux/aio.h
--- linux-2.5.68/include/linux/aio.h	Fri Apr 11 21:10:50 2003
+++ linux-aio-2568/include/linux/aio.h	Thu Apr 17 18:03:53 2003
@@ -54,7 +54,7 @@
 	struct file		*ki_filp;
 	struct kioctx		*ki_ctx;	/* may be NULL for sync ops */
 	int			(*ki_cancel)(struct kiocb *, struct io_event *);
-	long			(*ki_retry)(struct kiocb *);
+	ssize_t			(*ki_retry)(struct kiocb *);
 
 	struct list_head	ki_list;	/* the aio core uses this
 						 * for cancellation */
@@ -62,6 +62,16 @@
 	void			*ki_user_obj;	/* pointer to userland's iocb */
 	__u64			ki_user_data;	/* user's data for completion */
 	loff_t			ki_pos;
+	
+	/* State that we remember to be able to restart/retry  */
+	unsigned short		ki_opcode;
+	size_t			ki_nbytes; 	/* copy of iocb->aio_nbytes */
+	char 			*ki_buf;	/* remaining iocb->aio_buf */
+	size_t			ki_left; 	/* remaining bytes */
+	wait_queue_t		ki_wait;
+	long			ki_retried; 	/* just for testing */
+	long			ki_kicked; 	/* just for testing */
+	long			ki_queued; 	/* just for testing */
 
 	char			private[KIOCB_PRIVATE_SIZE];
 };
@@ -77,6 +87,8 @@
 		(x)->ki_ctx = &tsk->active_mm->default_kioctx;	\
 		(x)->ki_cancel = NULL;			\
 		(x)->ki_user_obj = tsk;			\
+		(x)->ki_user_data = 0;			\
+		init_wait((&(x)->ki_wait));		\
 	} while (0)
 
 #define AIO_RING_MAGIC			0xa10a10a1
@@ -156,6 +168,24 @@
 #define get_ioctx(kioctx)	do { if (unlikely(atomic_read(&(kioctx)->users) <= 0)) BUG(); atomic_inc(&(kioctx)->users); } while (0)
 #define put_ioctx(kioctx)	do { if (unlikely(atomic_dec_and_test(&(kioctx)->users))) __put_ioctx(kioctx); else if (unlikely(atomic_read(&(kioctx)->users) < 0)) BUG(); } while (0)
 
+#define in_aio() !is_sync_wait(current->io_wait)
+
+/* when sync behaviour is desired even if running in async context */
+#define do_sync_op(op)		if (in_aio()) { \
+	wait_queue_t *wait = current->io_wait; \
+	current->io_wait = NULL; \
+	op; \
+	current->io_wait = wait; \
+} else { \
+	op; \
+}	
+
+#define warn_if_async()	if (in_aio()) {\
+	printk(KERN_ERR "%s(%s:%d) called in async context!\n", \
+	__FUNCTION__, __FILE__, __LINE__); \
+	dump_stack(); \
+	}
+
 #include <linux/aio_abi.h>
 
 static inline struct kiocb *list_kiocb(struct list_head *h)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/errno.h linux-aio-2568/include/linux/errno.h
--- linux-2.5.68/include/linux/errno.h	Tue Mar 25 03:31:17 2003
+++ linux-aio-2568/include/linux/errno.h	Wed Apr 16 16:41:23 2003
@@ -22,6 +22,7 @@
 #define EBADTYPE	527	/* Type not supported by server */
 #define EJUKEBOX	528	/* Request initiated, but will not complete before timeout */
 #define EIOCBQUEUED	529	/* iocb queued, will get completion event */
+#define EIOCBRETRY	530	/* iocb queued, will trigger a retry */
 
 #endif
 
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/init_task.h linux-aio-2568/include/linux/init_task.h
--- linux-2.5.68/include/linux/init_task.h	Mon Apr 21 23:30:39 2003
+++ linux-aio-2568/include/linux/init_task.h	Mon Apr 21 17:26:58 2003
@@ -103,6 +103,7 @@
 	.alloc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.io_wait	= NULL,						\
 }
 
 
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/sched.h linux-aio-2568/include/linux/sched.h
--- linux-2.5.68/include/linux/sched.h	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/include/linux/sched.h	Mon Apr 21 17:26:58 2003
@@ -438,6 +438,8 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+/* current io wait handle */
+	wait_queue_t *io_wait;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/wait.h linux-aio-2568/include/linux/wait.h
--- linux-2.5.68/include/linux/wait.h	Tue Mar 25 03:30:44 2003
+++ linux-aio-2568/include/linux/wait.h	Tue Apr 22 00:07:42 2003
@@ -80,6 +80,8 @@
 	return !list_empty(&q->task_list);
 }
 
+#define is_sync_wait(wait)	(!(wait) || ((wait)->task))
+
 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
diff -ur -X /home/kiran/dontdiff linux-2.5.68/kernel/fork.c linux-aio-2568/kernel/fork.c
--- linux-2.5.68/kernel/fork.c	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/kernel/fork.c	Mon Apr 21 17:26:58 2003
@@ -139,8 +139,9 @@
 void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
 	unsigned long flags;
-
-	__set_current_state(state);
+	
+	if (is_sync_wait(wait))
+		__set_current_state(state);
 	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
@@ -153,7 +154,8 @@
 {
 	unsigned long flags;
 
-	__set_current_state(state);
+	if (is_sync_wait(wait))
+		__set_current_state(state);
 	wait->flags |= WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
@@ -856,6 +858,7 @@
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
+	p->io_wait = NULL;
 
 	retval = -ENOMEM;
 	if (security_task_alloc(p))

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/7] Filesystem AIO rdwr - aio read
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
  2003-04-24  5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
@ 2003-04-24  5:04 ` Suparna Bhattacharya
  2003-04-24  5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:04 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> 02aiord.patch	 	: Filesystem aio read 

Changes to generic_file_aio_read code to use the aio
retry infrastructure. The main blocking points addressed
are lock_page and wait_on_page_locked.

Regards
Suparna
-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India

02aiord.patch
.............
 include/linux/pagemap.h |   19 ++++++++++++++
 mm/filemap.c            |   63 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 72 insertions(+), 10 deletions(-)

diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/pagemap.h linux-aio-2568/include/linux/pagemap.h
--- linux-2.5.68/include/linux/pagemap.h	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/include/linux/pagemap.h	Mon Apr 21 17:26:58 2003
@@ -135,6 +135,16 @@
 	if (TestSetPageLocked(page))
 		__lock_page(page);
 }
+
+extern int FASTCALL(__lock_page_wq(struct page *page, wait_queue_t *wait));
+static inline int lock_page_wq(struct page *page, wait_queue_t *wait)
+{
+	if (TestSetPageLocked(page))
+		return __lock_page_wq(page, wait);
+	else
+		return 0;
+}
+
 	
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
@@ -155,6 +165,15 @@
 		wait_on_page_bit(page, PG_locked);
 }
 
+extern int FASTCALL(wait_on_page_bit_wq(struct page *page, int bit_nr, 
+	wait_queue_t *wait));
+static inline int wait_on_page_locked_wq(struct page *page, wait_queue_t *wait)
+{
+	if (PageLocked(page))
+		return wait_on_page_bit_wq(page, PG_locked, wait);
+	return 0;
+}
+
 /* 
  * Wait for a page to complete writeback
  */
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/filemap.c linux-aio-2568/mm/filemap.c
--- linux-2.5.68/mm/filemap.c	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/filemap.c	Tue Apr 22 00:28:55 2003
@@ -268,19 +268,32 @@
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-void wait_on_page_bit(struct page *page, int bit_nr)
+int wait_on_page_bit_wq(struct page *page, int bit_nr, wait_queue_t *wait)
 {
 	wait_queue_head_t *waitqueue = page_waitqueue(page);
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT(local_wait);
 
+	if (!wait)
+		wait = &local_wait;
+		
 	do {
-		prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(waitqueue, wait, TASK_UNINTERRUPTIBLE);
 		if (test_bit(bit_nr, &page->flags)) {
 			sync_page(page);
+			if (!is_sync_wait(wait))
+				return -EIOCBRETRY;
 			io_schedule();
 		}
 	} while (test_bit(bit_nr, &page->flags));
-	finish_wait(waitqueue, &wait);
+	finish_wait(waitqueue, wait);
+
+	return 0;
+}
+EXPORT_SYMBOL(wait_on_page_bit_wq);
+
+void wait_on_page_bit(struct page *page, int bit_nr)
+{
+	wait_on_page_bit_wq(page, bit_nr, NULL);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
@@ -336,19 +349,31 @@
  * chances are that on the second loop, the block layer's plug list is empty,
  * so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
  */
-void __lock_page(struct page *page)
+int __lock_page_wq(struct page *page, wait_queue_t *wait)
 {
 	wait_queue_head_t *wqh = page_waitqueue(page);
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT(local_wait);
 
+	if (!wait)
+		wait = &local_wait;
+		
 	while (TestSetPageLocked(page)) {
-		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
 		if (PageLocked(page)) {
 			sync_page(page);
+			if (!is_sync_wait(wait))
+				return -EIOCBRETRY;
 			io_schedule();
 		}
 	}
-	finish_wait(wqh, &wait);
+	finish_wait(wqh, wait);
+	return 0;
+}
+EXPORT_SYMBOL(__lock_page_wq);
+
+void __lock_page(struct page *page)
+{
+	__lock_page_wq(page, NULL);	
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -621,7 +655,13 @@
 			goto page_ok;
 
 		/* Get exclusive access to the page ... */
-		lock_page(page);
+		
+		if (lock_page_wq(page, current->io_wait)) {
+			pr_debug("queued lock page \n");
+			error = -EIOCBRETRY;
+			/* TBD: should we hold on to the cached page ? */
+			goto sync_error;
+		}
 
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
@@ -643,12 +683,19 @@
 		if (!error) {
 			if (PageUptodate(page))
 				goto page_ok;
-			wait_on_page_locked(page);
+			if (wait_on_page_locked_wq(page, current->io_wait)) {
+				pr_debug("queued wait_on_page \n");
+				error = -EIOCBRETRY;
+				/*TBD:should we hold on to the cached page ?*/
+				goto sync_error;
+			}
+			
 			if (PageUptodate(page))
 				goto page_ok;
 			error = -EIO;
 		}
 
+sync_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
 		page_cache_release(page);
@@ -819,6 +866,10 @@
 	struct kiocb kiocb;
 	ssize_t ret;
 
+	if (current->io_wait != NULL) {
+		printk("current->io_wait != NULL\n");
+		dump_stack();
+	}
 	init_sync_kiocb(&kiocb, filp);
 	ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
 	if (-EIOCBQUEUED == ret)
@@ -851,6 +902,7 @@
 {
 	read_descriptor_t desc;
 
+	BUG_ON(current->io_wait != NULL);
 	if (!count)
 		return 0;
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem AIO read-write patches
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
  2003-04-24  5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
  2003-04-24  5:04 ` [PATCH 2/7] Filesystem AIO rdwr - aio read Suparna Bhattacharya
@ 2003-04-24  5:11 ` Suparna Bhattacharya
  2003-04-24 12:48   ` Matthew Wilcox
  2003-04-24  5:16 ` [PATCH 4/7] Filesystem AIO rdwr - async down (x86) Suparna Bhattacharya
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:11 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> 03aiowr.patch	 	: Minimal filesystem aio write 
> 			(for all archs and all filesystems 
> 			 using generic_file_aio_write)

Changes to enable generic_file_aio_write use the aio 
retry infrastructure. The blocking points addressed are
in find_lock_page and blk_congestion_wait.

(Async down and async get block are available as optional
follow on patches)

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India

03aiowr.patch
..............
 drivers/block/ll_rw_blk.c |   21 +++++++++++++++++----
 include/linux/blkdev.h    |    1 +
 include/linux/writeback.h |    4 ++--
 mm/filemap.c              |   31 ++++++++++++++++++++++++++-----
 mm/page-writeback.c       |   17 ++++++++++++-----
 5 files changed, 58 insertions(+), 16 deletions(-)

diff -ur -X /home/kiran/dontdiff linux-2.5.68/drivers/block/ll_rw_blk.c linux-aio-2568/drivers/block/ll_rw_blk.c
--- linux-2.5.68/drivers/block/ll_rw_blk.c	Fri Apr 11 21:10:45 2003
+++ linux-aio-2568/drivers/block/ll_rw_blk.c	Wed Apr 16 23:47:25 2003
@@ -1564,17 +1564,30 @@
  * If no queues are congested then just wait for the next request to be
  * returned.
  */
-void blk_congestion_wait(int rw, long timeout)
+int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait)
 {
-	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[rw];
+	DEFINE_WAIT(local_wait);
+
+	if (!wait)
+		wait = &local_wait;
 
 	blk_run_queues();
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
+	if (!is_sync_wait(wait)) 
+		return -EIOCBRETRY;
+
 	io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	finish_wait(wqh, wait);
+	return 0;
+}
+
+void blk_congestion_wait(int rw, long timeout)
+{
+	blk_congestion_wait_wq(rw, timeout, NULL);
 }
 
+
 /*
  * Has to be called with the request spinlock acquired
  */
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/blkdev.h linux-aio-2568/include/linux/blkdev.h
--- linux-2.5.68/include/linux/blkdev.h	Fri Apr 11 21:10:51 2003
+++ linux-aio-2568/include/linux/blkdev.h	Wed Apr 16 16:31:12 2003
@@ -390,6 +390,7 @@
 extern void blk_queue_free_tags(request_queue_t *);
 extern void blk_queue_invalidate_tags(request_queue_t *);
 extern void blk_congestion_wait(int rw, long timeout);
+extern int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait);
 
 #define MAX_PHYS_SEGMENTS 128
 #define MAX_HW_SEGMENTS 128
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/writeback.h linux-aio-2568/include/linux/writeback.h
--- linux-2.5.68/include/linux/writeback.h	Tue Mar 25 03:30:01 2003
+++ linux-aio-2568/include/linux/writeback.h	Wed Apr 16 16:31:12 2003
@@ -80,8 +80,8 @@
 
 
 void page_writeback_init(void);
-void balance_dirty_pages(struct address_space *mapping);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+int balance_dirty_pages(struct address_space *mapping);
+int balance_dirty_pages_ratelimited(struct address_space *mapping);
 int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
 
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/filemap.c linux-aio-2568/mm/filemap.c
--- linux-2.5.68/mm/filemap.c	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/filemap.c	Tue Apr 22 00:28:55 2003
@@ -398,8 +423,8 @@
  *
  * Returns zero if the page was not present. find_lock_page() may sleep.
  */
-struct page *find_lock_page(struct address_space *mapping,
-				unsigned long offset)
+struct page *find_lock_page_wq(struct address_space *mapping,
+				unsigned long offset, wait_queue_t *wait)
 {
 	struct page *page;
 
@@ -410,7 +435,10 @@
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			spin_unlock(&mapping->page_lock);
-			lock_page(page);
+			if (-EIOCBRETRY == lock_page_wq(page, wait)) {
+				page_cache_release(page);
+				return ERR_PTR(-EIOCBRETRY);
+			}
 			spin_lock(&mapping->page_lock);
 
 			/* Has the page been truncated while we slept? */
@@ -425,6 +453,12 @@
 	return page;
 }
 
+struct page *find_lock_page(struct address_space *mapping,
+				unsigned long offset)
+{
+	return find_lock_page_wq(mapping, offset, NULL);
+}
+
 /**
  * find_or_create_page - locate or add a pagecache page
  *
@@ -1373,7 +1425,9 @@
 	int err;
 	struct page *page;
 repeat:
-	page = find_lock_page(mapping, index);
+	page = find_lock_page_wq(mapping, index, current->io_wait);
+	if (IS_ERR(page))
+		return page;
 	if (!page) {
 		if (!*cached_page) {
 			*cached_page = page_cache_alloc(mapping);
@@ -1690,6 +1744,10 @@
 		fault_in_pages_readable(buf, bytes);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+		if (IS_ERR(page)) {
+			status = PTR_ERR(page);
+			break;
+		}
 		if (!page) {
 			status = -ENOMEM;
 			break;
@@ -1697,6 +1755,8 @@
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status)) {
+			if (-EIOCBRETRY == status)
+				pr_debug("queued prepare_write\n");
 			/*
 			 * prepare_write() may have instantiated a few blocks
 			 * outside i_size.  Trim these off again.
@@ -1737,7 +1797,11 @@
 		page_cache_release(page);
 		if (status < 0)
 			break;
-		balance_dirty_pages_ratelimited(mapping);
+		status = balance_dirty_pages_ratelimited(mapping);
+		if (status < 0) {
+			pr_debug("async balance_dirty_pages\n");
+			break;
+		}
 		cond_resched();
 	} while (count);
 	*ppos = pos;
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/page-writeback.c linux-aio-2568/mm/page-writeback.c
--- linux-2.5.68/mm/page-writeback.c	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/page-writeback.c	Tue Apr 22 01:31:35 2003
@@ -135,7 +135,7 @@
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-void balance_dirty_pages(struct address_space *mapping)
+int balance_dirty_pages(struct address_space *mapping)
 {
 	struct page_state ps;
 	long nr_reclaimable;
@@ -152,6 +152,7 @@
 			.sync_mode	= WB_SYNC_NONE,
 			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
+			.nonblocking	= !is_sync_wait(current->io_wait)
 		};
 
 		get_dirty_limits(&ps, &background_thresh, &dirty_thresh);
@@ -178,7 +179,11 @@
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		blk_congestion_wait(WRITE, HZ/10);
+		if (-EIOCBRETRY == blk_congestion_wait_wq(WRITE, HZ/10,
+			current->io_wait)) {
+			pr_debug("async blk congestion wait\n");
+			return -EIOCBRETRY;
+		}
 	}
 
 	if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)
@@ -186,6 +191,8 @@
 
 	if (!writeback_in_progress(bdi) && nr_reclaimable > background_thresh)
 		pdflush_operation(background_writeout, 0);
+
+	return 0;
 }
 
 /**
@@ -201,7 +208,7 @@
  * decrease the ratelimiting by a lot, to prevent individual processes from
  * overshooting the limit by (ratelimit_pages) each.
  */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+int balance_dirty_pages_ratelimited(struct address_space *mapping)
 {
 	static DEFINE_PER_CPU(int, ratelimits) = 0;
 	int cpu;
@@ -215,10 +222,10 @@
 	if (per_cpu(ratelimits, cpu)++ >= ratelimit) {
 		per_cpu(ratelimits, cpu) = 0;
 		put_cpu();
-		balance_dirty_pages(mapping);
-		return;
+		return balance_dirty_pages(mapping);
 	}
 	put_cpu();
+	return 0;
 }
 EXPORT_SYMBOL_GPL(balance_dirty_pages_ratelimited);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/7] Filesystem AIO rdwr - async down (x86)
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
                   ` (2 preceding siblings ...)
  2003-04-24  5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
@ 2003-04-24  5:16 ` Suparna Bhattacharya
  2003-04-24  5:19 ` [5/7] Filesystem AIO rdwr - use down_wq for aio write Suparna Bhattacharya
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:16 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> 04down_wq-86.patch 	: An asynchronous semaphore down
> 		     	implementation (currently x86 
> 			only)

The semaphore wait seems to be embedded deep in arch 
specific code (am not sure that really needed to be
arch specific, though the asm wrappers would have still
been needed for every arch). 

So I've done this for x86 only at present. Need to
extend this to other archs (if we see good performance 
gains) for put in compat definitions to avoid breaking 
any arch.

04aiodown_wq-x86.patch
......................
 arch/i386/kernel/i386_ksyms.c |    2 +-
 arch/i386/kernel/semaphore.c  |   30 ++++++++++++++++++++----------
 include/asm-i386/semaphore.h  |   27 ++++++++++++++++++---------
 3 files changed, 39 insertions(+), 20 deletions(-)


diff -ur -X /home/kiran/dontdiff linux-2.5.68/arch/i386/kernel/i386_ksyms.c linux-aio-2568/arch/i386/kernel/i386_ksyms.c
--- linux-2.5.68/arch/i386/kernel/i386_ksyms.c	Mon Apr 21 23:30:26 2003
+++ linux-aio-2568/arch/i386/kernel/i386_ksyms.c	Mon Apr 21 17:26:55 2003
@@ -98,7 +98,7 @@
 EXPORT_SYMBOL(__io_virt_debug);
 #endif
 
-EXPORT_SYMBOL_NOVERS(__down_failed);
+EXPORT_SYMBOL_NOVERS(__down_failed_wq);
 EXPORT_SYMBOL_NOVERS(__down_failed_interruptible);
 EXPORT_SYMBOL_NOVERS(__down_failed_trylock);
 EXPORT_SYMBOL_NOVERS(__up_wakeup);
diff -ur -X /home/kiran/dontdiff linux-2.5.68/arch/i386/kernel/semaphore.c linux-aio-2568/arch/i386/kernel/semaphore.c
--- linux-2.5.68/arch/i386/kernel/semaphore.c	Tue Mar 25 03:30:20 2003
+++ linux-aio-2568/arch/i386/kernel/semaphore.c	Tue Apr 22 01:29:29 2003
@@ -15,6 +15,7 @@
 #include <linux/config.h>
 #include <linux/sched.h>
 #include <linux/err.h>
+#include <linux/errno.h>
 #include <asm/semaphore.h>
 
 /*
@@ -53,15 +54,20 @@
 	wake_up(&sem->wait);
 }
 
-void __down(struct semaphore * sem)
+int __down_wq(struct semaphore * sem, wait_queue_t *wait)
 {
 	struct task_struct *tsk = current;
-	DECLARE_WAITQUEUE(wait, tsk);
+	DECLARE_WAITQUEUE(local_wait, tsk);
 	unsigned long flags;
 
-	tsk->state = TASK_UNINTERRUPTIBLE;
+	if (is_sync_wait(wait))
+		tsk->state = TASK_UNINTERRUPTIBLE;
+	if (!wait) {
+		wait = &local_wait;
+	}
+
 	spin_lock_irqsave(&sem->wait.lock, flags);
-	add_wait_queue_exclusive_locked(&sem->wait, &wait);
+	add_wait_queue_exclusive_locked(&sem->wait, wait);
 
 	sem->sleepers++;
 	for (;;) {
@@ -79,17 +85,23 @@
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
 
+		if (!is_sync_wait(wait))
+			return -EIOCBRETRY;
+
 		schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_UNINTERRUPTIBLE;
 	}
-	remove_wait_queue_locked(&sem->wait, &wait);
+	if (is_sync_wait(wait) || !list_empty(&wait->task_list))
+		remove_wait_queue_locked(&sem->wait, wait);
 	wake_up_locked(&sem->wait);
 	spin_unlock_irqrestore(&sem->wait.lock, flags);
 	tsk->state = TASK_RUNNING;
+	return 0;
 }
 
+
 int __down_interruptible(struct semaphore * sem)
 {
 	int retval = 0;
@@ -189,19 +201,17 @@
 asm(
 ".text\n"
 ".align 4\n"
-".globl __down_failed\n"
-"__down_failed:\n\t"
+".globl __down_failed_wq\n"
+"__down_failed_wq:\n\t"
 #if defined(CONFIG_FRAME_POINTER)
 	"pushl %ebp\n\t"
 	"movl  %esp,%ebp\n\t"
 #endif
-	"pushl %eax\n\t"
 	"pushl %edx\n\t"
 	"pushl %ecx\n\t"
-	"call __down\n\t"
+	"call __down_wq\n\t"
 	"popl %ecx\n\t"
 	"popl %edx\n\t"
-	"popl %eax\n\t"
 #if defined(CONFIG_FRAME_POINTER)
 	"movl %ebp,%esp\n\t"
 	"popl %ebp\n\t"
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/asm-i386/semaphore.h linux-aio-2568/include/asm-i386/semaphore.h
--- linux-2.5.68/include/asm-i386/semaphore.h	Tue Mar 25 03:30:11 2003
+++ linux-aio-2568/include/asm-i386/semaphore.h	Wed Apr 16 17:57:05 2003
@@ -96,39 +96,48 @@
 	sema_init(sem, 0);
 }
 
-asmlinkage void __down_failed(void /* special register calling convention */);
+asmlinkage int __down_failed_wq(void /* special register calling convention */);
 asmlinkage int  __down_failed_interruptible(void  /* params in registers */);
 asmlinkage int  __down_failed_trylock(void  /* params in registers */);
 asmlinkage void __up_wakeup(void /* special register calling convention */);
 
-asmlinkage void __down(struct semaphore * sem);
+asmlinkage int __down_wq(struct semaphore * sem, wait_queue_t *wait);
 asmlinkage int  __down_interruptible(struct semaphore * sem);
 asmlinkage int  __down_trylock(struct semaphore * sem);
 asmlinkage void __up(struct semaphore * sem);
 
 /*
  * This is ugly, but we want the default case to fall through.
- * "__down_failed" is a special asm handler that calls the C
+ * "__down_failed_wq" is a special asm handler that calls the C
  * routine that actually waits. See arch/i386/kernel/semaphore.c
  */
-static inline void down(struct semaphore * sem)
+static inline int down_wq(struct semaphore * sem, wait_queue_t *wait)
 {
+	int result;
+
 #ifdef WAITQUEUE_DEBUG
 	CHECK_MAGIC(sem->__magic);
 #endif
 	might_sleep();
 	__asm__ __volatile__(
 		"# atomic down operation\n\t"
-		LOCK "decl %0\n\t"     /* --sem->count */
-		"js 2f\n"
+		LOCK "decl %1\n\t"     /* --sem->count */
+		"js 2f\n\t"
+		"xorl %0,%0\n"
 		"1:\n"
 		LOCK_SECTION_START("")
-		"2:\tcall __down_failed\n\t"
+		"2:\tcall __down_failed_wq\n\t"
 		"jmp 1b\n"
 		LOCK_SECTION_END
-		:"=m" (sem->count)
-		:"c" (sem)
+		:"=a" (result), "=m" (sem->count)
+		:"c" (sem), "d" (wait)
 		:"memory");
+	return result;
+}
+
+static inline void down(struct semaphore * sem)
+{
+	down_wq(sem, NULL);	
 }
 
 /*

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [5/7] Filesystem AIO rdwr - use down_wq for aio write
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
                   ` (3 preceding siblings ...)
  2003-04-24  5:16 ` [PATCH 4/7] Filesystem AIO rdwr - async down (x86) Suparna Bhattacharya
@ 2003-04-24  5:19 ` Suparna Bhattacharya
  2003-04-24  5:21 ` [PATCH 6/7] Filesystem AIO rdwr - async bread Suparna Bhattacharya
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:19 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> 05aiowrdown_wq.patch 	: Uses async down for aio write

Will currently compile only for x86, since down_wq
support isn't implemented on other archs.

Regards
Suparna

05aiowrdown_wq.patch
....................
 filemap.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletion(-)


diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/filemap.c linux-aio-2568/mm/filemap.c
--- linux-2.5.68/mm/filemap.c	Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/filemap.c	Tue Apr 22 00:28:55 2003
@@ -1786,7 +1851,8 @@
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	down(&inode->i_sem);
+	if ((err = down_wq(&inode->i_sem, current->io_wait)))
+		return err;
 	err = generic_file_aio_write_nolock(iocb, &local_iov, 1, 
 						&iocb->ki_pos);
 	up(&inode->i_sem);

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 6/7] Filesystem AIO rdwr - async bread
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
                   ` (4 preceding siblings ...)
  2003-04-24  5:19 ` [5/7] Filesystem AIO rdwr - use down_wq for aio write Suparna Bhattacharya
@ 2003-04-24  5:21 ` Suparna Bhattacharya
  2003-04-24  5:28 ` [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2 Suparna Bhattacharya
  2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
  7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:21 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> 06bread_wq.patch	: Async bread implementation
> 

Pre-req for the filesystem specific async get block 
patches.

Regards
Suparna

06bread_wq.patch
................

 fs/buffer.c                 |   54 +++++++++++++++++++++++++++++++++++++-------
 include/linux/buffer_head.h |   30 ++++++++++++++++++++++--
 2 files changed, 73 insertions(+), 11 deletions(-)

diff -ur -X /home/kiran/dontdiff linux-2.5.68/fs/buffer.c linux-aio-2568/fs/buffer.c
--- linux-2.5.68/fs/buffer.c	Mon Apr 21 23:30:33 2003
+++ linux-aio-2568/fs/buffer.c	Mon Apr 21 17:26:58 2003
@@ -118,23 +118,35 @@
  * from becoming locked again - you have to lock it yourself
  * if you want to preserve its state.
  */
-void __wait_on_buffer(struct buffer_head * bh)
+int __wait_on_buffer_wq(struct buffer_head * bh, wait_queue_t *wait)
 {
 	wait_queue_head_t *wqh = bh_waitq_head(bh);
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT(local_wait);
+
+	if (!wait)
+		wait = &local_wait;
 
 	if (atomic_read(&bh->b_count) == 0 &&
 			(!bh->b_page || !PageLocked(bh->b_page)))
 		buffer_error();
 
 	do {
-		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
 		if (buffer_locked(bh)) {
 			blk_run_queues();
+			if (!is_sync_wait(wait)) {
+				return -EIOCBRETRY;
+			}
 			io_schedule();
 		}
 	} while (buffer_locked(bh));
-	finish_wait(wqh, &wait);
+	finish_wait(wqh, wait);
+	return 0;
+}
+
+void __wait_on_buffer(struct buffer_head * bh)
+{
+	__wait_on_buffer_wq(bh, NULL);
 }
 
 static void
@@ -1190,9 +1202,12 @@
 	__brelse(bh);
 }
 
-static struct buffer_head *__bread_slow(struct buffer_head *bh)
+static struct buffer_head *__bread_slow_wq(struct buffer_head *bh, 
+		wait_queue_t *wait)
 {
-	lock_buffer(bh);
+	if (-EIOCBRETRY == lock_buffer_wq(bh, wait))
+		return ERR_PTR(-EIOCBRETRY);
+
 	if (buffer_uptodate(bh)) {
 		unlock_buffer(bh);
 		return bh;
@@ -1202,7 +1217,8 @@
 		get_bh(bh);
 		bh->b_end_io = end_buffer_io_sync;
 		submit_bh(READ, bh);
-		wait_on_buffer(bh);
+		if (-EIOCBRETRY == wait_on_buffer_wq(bh, wait))
+			return ERR_PTR(-EIOCBRETRY);
 		if (buffer_uptodate(bh))
 			return bh;
 	}
@@ -1210,6 +1226,11 @@
 	return NULL;
 }
 
+static inline struct buffer_head *__bread_slow(struct buffer_head *bh)
+{
+	return __bread_slow_wq(bh, NULL);
+}
+	
 /*
  * Per-cpu buffer LRU implementation.  To reduce the cost of __find_get_block().
  * The bhs[] array is sorted - newest buffer is at bhs[0].  Buffers have their
@@ -1384,6 +1405,18 @@
 		bh = __bread_slow(bh);
 	return bh;
 }
+
+
+struct buffer_head *
+__bread_wq(struct block_device *bdev, sector_t block, int size,
+	wait_queue_t *wait)
+{
+	struct buffer_head *bh = __getblk(bdev, block, size);
+
+	if (!buffer_uptodate(bh))
+		bh = __bread_slow_wq(bh, wait);
+	return bh;
+}
 EXPORT_SYMBOL(__bread);
 
 /*
@@ -1822,8 +1855,11 @@
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {
 			err = get_block(inode, block, bh, 1);
-			if (err)
+			if (err) {
+				if (-EIOCBRETRY == err)
+					pr_debug("get_block queued\n");
 				goto out;
+			}
 			if (buffer_new(bh)) {
 				clear_buffer_new(bh);
 				unmap_underlying_metadata(bh->b_bdev,
@@ -1865,6 +1901,8 @@
 	 * If we issued read requests - let them complete.
 	 */
 	while(wait_bh > wait) {
+		if (!is_sync_wait(current->io_wait))
+			printk("block_prepare_write: wait on buffer\n");
 		wait_on_buffer(*--wait_bh);
 		if (!buffer_uptodate(*wait_bh))
 			return -EIO;
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/buffer_head.h linux-aio-2568/include/linux/buffer_head.h
--- linux-2.5.68/include/linux/buffer_head.h	Fri Apr 11 21:10:51 2003
+++ linux-aio-2568/include/linux/buffer_head.h	Wed Apr 16 16:31:12 2003
@@ -162,6 +162,7 @@
 void __invalidate_buffers(kdev_t dev, int);
 int sync_blockdev(struct block_device *bdev);
 void __wait_on_buffer(struct buffer_head *);
+int __wait_on_buffer_wq(struct buffer_head *, wait_queue_t *wait);
 wait_queue_head_t *bh_waitq_head(struct buffer_head *bh);
 void wake_up_buffer(struct buffer_head *bh);
 int fsync_bdev(struct block_device *);
@@ -172,6 +173,8 @@
 void __brelse(struct buffer_head *);
 void __bforget(struct buffer_head *);
 struct buffer_head *__bread(struct block_device *, sector_t block, int size);
+struct buffer_head *__bread_wq(struct block_device *, sector_t block, 
+	int size, wait_queue_t *wait);
 struct buffer_head *alloc_buffer_head(void);
 void free_buffer_head(struct buffer_head * bh);
 void FASTCALL(unlock_buffer(struct buffer_head *bh));
@@ -229,13 +232,13 @@
 
 static inline void brelse(struct buffer_head *bh)
 {
-	if (bh)
+	if (bh && !IS_ERR(bh))
 		__brelse(bh);
 }
 
 static inline void bforget(struct buffer_head *bh)
 {
-	if (bh)
+	if (bh && !IS_ERR(bh))
 		__bforget(bh);
 }
 
@@ -246,7 +249,12 @@
 }
 
 static inline struct buffer_head *
-sb_getblk(struct super_block *sb, sector_t block)
+sb_bread_wq(struct super_block *sb, sector_t block, wait_queue_t *wait)
+{
+	return __bread_wq(sb->s_bdev, block, sb->s_blocksize, wait);
+}
+
+static inline struct buffer_head *sb_getblk(struct super_block *sb, sector_t block)
 {
 	return __getblk(sb->s_bdev, block, sb->s_blocksize);
 }
@@ -276,10 +284,26 @@
 		__wait_on_buffer(bh);
 }
 
+static inline int wait_on_buffer_wq(struct buffer_head *bh, wait_queue_t *wait)
+{
+	if (buffer_locked(bh))
+		return __wait_on_buffer_wq(bh, wait);
+
+	return 0;
+}
+
 static inline void lock_buffer(struct buffer_head *bh)
 {
 	while (test_set_buffer_locked(bh))
 		__wait_on_buffer(bh);
 }
 
+static inline int lock_buffer_wq(struct buffer_head *bh, wait_queue_t *wait)
+{
+	if (test_set_buffer_locked(bh))
+		return __wait_on_buffer_wq(bh, wait);
+
+	return 0;
+}
+
 #endif /* _LINUX_BUFFER_HEAD_H */

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
                   ` (5 preceding siblings ...)
  2003-04-24  5:21 ` [PATCH 6/7] Filesystem AIO rdwr - async bread Suparna Bhattacharya
@ 2003-04-24  5:28 ` Suparna Bhattacharya
  2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
  7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24  5:28 UTC (permalink / raw)
  To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
> 
> 07ext2getblk_wq.patch	: Async get block support for 
> 			  the ext2 filesystem

ext2_get_block_wq() is used only by ext2_prepare_write().
All other cases still use the synchronous get block
version.

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India

07ext2getblk_wq.patch
.....................
 fs/ext2/inode.c |   44 +++++++++++++++++++++++++++++++++++---------
 1 files changed, 35 insertions(+), 9 deletions(-)


diff -ur -X /home/kiran/dontdiff linux-2.5.68/fs/ext2/inode.c linux-aio-2568/fs/ext2/inode.c
--- linux-2.5.68/fs/ext2/inode.c	Fri Apr 11 21:10:49 2003
+++ linux-aio-2568/fs/ext2/inode.c	Wed Apr 16 22:48:49 2003
@@ -257,11 +258,12 @@
  *	or when it reads all @depth-1 indirect blocks successfully and finds
  *	the whole chain, all way to the data (returns %NULL, *err == 0).
  */
-static Indirect *ext2_get_branch(struct inode *inode,
+static Indirect *ext2_get_branch_wq(struct inode *inode,
 				 int depth,
 				 int *offsets,
 				 Indirect chain[4],
-				 int *err)
+				 int *err,
+				 wait_queue_t *wait)
 {
 	struct super_block *sb = inode->i_sb;
 	Indirect *p = chain;
@@ -273,8 +275,8 @@
 	if (!p->key)
 		goto no_block;
 	while (--depth) {
-		bh = sb_bread(sb, le32_to_cpu(p->key));
-		if (!bh)
+		bh = sb_bread_wq(sb, le32_to_cpu(p->key), wait);
+		if (!bh || IS_ERR(bh))
 			goto failure;
 		read_lock(&EXT2_I(inode)->i_meta_lock);
 		if (!verify_chain(chain, p))
@@ -292,11 +294,21 @@
 	*err = -EAGAIN;
 	goto no_block;
 failure:
-	*err = -EIO;
+	*err = IS_ERR(bh) ? PTR_ERR(bh) : -EIO;
 no_block:
 	return p;
 }
 
+static Indirect *ext2_get_branch(struct inode *inode,
+				 int depth,
+				 int *offsets,
+				 Indirect chain[4],
+				 int *err)
+{
+	return ext2_get_branch_wq(inode, depth, offsets, chain, 
+		err, NULL);
+}
+
 /**
  *	ext2_find_near - find a place for allocation with sufficient locality
  *	@inode: owner
@@ -536,7 +548,8 @@
  * reachable from inode.
  */
 
-static int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create)
+static int ext2_get_block_wq(struct inode *inode, sector_t iblock, 
+	struct buffer_head *bh_result, int create, wait_queue_t *wait)
 {
 	int err = -EIO;
 	int offsets[4];
@@ -551,7 +564,8 @@
 		goto out;
 
 reread:
-	partial = ext2_get_branch(inode, depth, offsets, chain, &err);
+	partial = ext2_get_branch_wq(inode, depth, offsets, chain, &err, 
+		wait);
 
 	/* Simplest case - block found, no allocation needed */
 	if (!partial) {
@@ -565,7 +579,7 @@
 	}
 
 	/* Next simple case - plain lookup or failed read of indirect block */
-	if (!create || err == -EIO) {
+	if (!create || err == -EIO || err == -EIOCBRETRY) {
 cleanup:
 		while (partial > chain) {
 			brelse(partial->bh);
@@ -606,6 +620,19 @@
 	goto reread;
 }
 
+static int ext2_get_block_async(struct inode *inode, sector_t iblock, 
+	struct buffer_head *bh_result, int create)
+{
+	return ext2_get_block_wq(inode, iblock, bh_result, create, 
+		current->io_wait);
+}
+
+static int ext2_get_block(struct inode *inode, sector_t iblock, 
+	struct buffer_head *bh_result, int create)
+{
+	return ext2_get_block_wq(inode, iblock, bh_result, create, NULL);
+}
+
 static int ext2_writepage(struct page *page, struct writeback_control *wbc)
 {
 	return block_write_full_page(page, ext2_get_block, wbc);
@@ -627,7 +654,7 @@
 ext2_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	return block_prepare_write(page,from,to,ext2_get_block);
+	return block_prepare_write(page,from,to,ext2_get_block_async);
 }
 
 static int

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem AIO read-write patches
  2003-04-24  5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
@ 2003-04-24 12:48   ` Matthew Wilcox
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2003-04-24 12:48 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: bcrl, akpm, linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:41:03AM +0530, Suparna Bhattacharya wrote:
> -void blk_congestion_wait(int rw, long timeout)
> +int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait)
>  {
> -	DEFINE_WAIT(wait);
>  	wait_queue_head_t *wqh = &congestion_wqh[rw];
> +	DEFINE_WAIT(local_wait);
> +
> +	if (!wait)
> +		wait = &local_wait;
>  
>  	blk_run_queues();
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
> +	if (!is_sync_wait(wait)) 
> +		return -EIOCBRETRY;
> +
>  	io_schedule_timeout(timeout);
> -	finish_wait(wqh, &wait);
> +	finish_wait(wqh, wait);
> +	return 0;
> +}
> +
> +void blk_congestion_wait(int rw, long timeout)
> +{
> +	blk_congestion_wait_wq(rw, timeout, NULL);
>  }

I don't like this one.  DEFINE_WAIT allocates stuff on the stack.  
And aio seems to allocate a lot more on the stack than normal io does.
How about the following instead:

int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait)
{
 	wait_queue_head_t *wqh = &congestion_wqh[rw];

 	blk_run_queues();
	prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
	if (!is_sync_wait(wait)) 
		return -EIOCBRETRY;

	io_schedule_timeout(timeout);
	finish_wait(wqh, wait);
	return 0;
}

void blk_congestion_wait(int rw, long timeout)
{
	DEFINE_WAIT(wait);
	blk_congestion_wait_wq(rw, timeout, &wait);
}

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem AIO read-write patches
  2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
                   ` (6 preceding siblings ...)
  2003-04-24  5:28 ` [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2 Suparna Bhattacharya
@ 2003-04-24 13:13 ` Benjamin LaHaise
  2003-04-24 14:27   ` Suparna Bhattacharya
  7 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2003-04-24 13:13 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: akpm, linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> The task->io_wait field reflects the wait context in which
> a task is executing its i/o operations. For synchronous i/o
> task->io_wait is NULL, and the wait context is local on
> stack; for threads doing io submits or retries on behalf 
> of async i/o callers, tsk->io_wait is the wait queue 
> function entry to be notified on completion of a condition 
> required for i/o to progress. 

This is seriously wrong.  Changing the behaviour of functions to depend on 
a hidden parameter is bound to lead to coding errors.  It is far better to 
make the parameter explicite rather than implicite.

		-ben

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem AIO read-write patches
  2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
@ 2003-04-24 14:27   ` Suparna Bhattacharya
  0 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 14:27 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: akpm, linux-kernel, linux-aio, linux-fsdevel

On Thu, Apr 24, 2003 at 09:13:12AM -0400, Benjamin LaHaise wrote:
> On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> > The task->io_wait field reflects the wait context in which
> > a task is executing its i/o operations. For synchronous i/o
> > task->io_wait is NULL, and the wait context is local on
> > stack; for threads doing io submits or retries on behalf 
> > of async i/o callers, tsk->io_wait is the wait queue 
> > function entry to be notified on completion of a condition 
> > required for i/o to progress. 
> 
> This is seriously wrong.  Changing the behaviour of functions to depend on 
> a hidden parameter is bound to lead to coding errors.  It is far better to 
> make the parameter explicite rather than implicite.

I really did think this over for several days before 
putting out the new iteration. And I'm still in two minds 
about this.

To me it seems like part of the context of execution rather
than a hidden parameter as you see it, much like an address 
space (the way you do a use_mm() for retries) or credentials 
for that matter (other operating systems for example pass a 
cred structure to vfs routines, but Linux doesn't). And 
because wait is typically associated with a task context 
anyway (in the synchronous world).

What differentiates if something should be picked up as 
part of a context or a parameter ? 

It should be a parameter if different values may be passed 
in within the same context (e.g routines which can modify
a different address space would take an mm parameter).

That's why all the _wq routines take a wait queue parameter,
and only routines which really are meant to be context
sensitive reference tsk->io_wait. That's why I got rid of
do_sync_op(), but still kept tsk->io_wait ...

The only thing that nags me is that the context of
an io operation is not a very long-lived one. 

That said, the main problem in implementing what you suggest
is that we'll need to change at least the following APIs 
rightaway for all filesystems:

- aops->prepare_write() 
- fs specific get block routines

to take a wait queue parameter (I don't say iocb
here, because these are per page or per block routines, 
which don't use the iocb information)

If there is a general agreement that we should do this for
2.5, then sure, we can go ahead. But at least I need to
hear a "yes its worth the pain, and yes its the right thing
to do for 2.5" from affected people.

The more asynchrony we decide to add, the more such api
changes would be needed. 

So if the decision is to go ahead and break all filesystems 
anyway, can we decide how far we should go with these and 
have the API changes made sooner rather than later ?

Should we be adding a wait queue or parameter to more
address space operations ?

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-04-24 14:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-24  4:52 Filesystem AIO read-write patches Suparna Bhattacharya
2003-04-24  5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
2003-04-24  5:04 ` [PATCH 2/7] Filesystem AIO rdwr - aio read Suparna Bhattacharya
2003-04-24  5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
2003-04-24 12:48   ` Matthew Wilcox
2003-04-24  5:16 ` [PATCH 4/7] Filesystem AIO rdwr - async down (x86) Suparna Bhattacharya
2003-04-24  5:19 ` [5/7] Filesystem AIO rdwr - use down_wq for aio write Suparna Bhattacharya
2003-04-24  5:21 ` [PATCH 6/7] Filesystem AIO rdwr - async bread Suparna Bhattacharya
2003-04-24  5:28 ` [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2 Suparna Bhattacharya
2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
2003-04-24 14:27   ` Suparna Bhattacharya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).