* Filesystem AIO read-write patches
@ 2003-04-24 4:52 Suparna Bhattacharya
2003-04-24 5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
` (7 more replies)
0 siblings, 8 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 4:52 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
Here is a revised version of the filesystem AIO patches
for 2.5.68.
It is built on a variation of the simple retry based
scheme originally suggested by Ben LaHaise.
Why ?
------
Because 2.5 is still missing real support for regular
filesystem AIO (but for O_DIRECT).
ext2, jfs and nfs define the fops aio interfaces aio_read
and aio_write to default to generic_file_aio_read/write.
However these routines show fully synchronous behaviour
unless the file was opened with O_DIRECT. This means that
an io_submit could merrily block for regular aio read/write
operations, while an application thinks its doing async
i/o.
How ?
------
The approach we took was to identify and focus on
the most significant blocking points (seconded by
observations from initial experimentation and profiling
results), and convert them to retry exit points.
Retries start at a very high level, driven directly by
the aio infrastructure (In future if the in-kernel fs
apis change, then retries could be modified to happen
one level below, i.e. at the api level). They are kicked
off via async wait queue functions. In synchronous i/o
context the default wait queue entries are synchronous
hence don't cause an exit at a retry point.
One of the considerations was to try to take a careful
and less intrusive route with minimal changes to existing
synchronous i/o paths. The intent was to achieve a
reasonable level of asynchrony in a way that could then
be further optimized and tuned for workloads of relevance.
The Patches:
-----------
(which I'll be mailing out as responses to this note)
01aioretry.patch : Base aio retry infrastructure
02aiord.patch : Filesystem aio read
03aiowr.patch : Minimal filesystem aio write
(for all archs and all filesystems
using generic_file_aio_write)
04down_wq-86.patch : An asynchronous semaphore down
implementation (currently x86
only)
05aiowrdown_wq.patch : Uses async down for aio write
06bread_wq.patch : Async bread implementation
07ext2getblk_wq.patch : Async get block support for
the ext2 filesystem
Observations
--------------
As a quick check to find out if this really works, I could
observe a decent reduction in the time spent in io_submit
(especially for large reads) when the file is not already
cached (e.g. first time access). For the write case, I
found that I had to add the async get block support
to get a perceptable benefit. For the cached case, there
wasn't any observable difference, which is expected.
The patch didn't seem to be hurting synchronous read/
write performance for a simple test.
Another thing I tried out was to temporarily move the
retries into io_getevents rather than worker threads
just as a sanity check for any gross impact on cpu
utilization. That seemed OK too.
Of course thorough performance testing is needed and
would show up places where there is scope for
tuning, and how it affects overall system performance
numbers.
I have been playing with it for a while now, and so
far its been running OK for me.
I would welcome feedback, bug reports, test results etc.
Full diffstat:
aiordwr-rollup.patch:
......................
arch/i386/kernel/i386_ksyms.c | 2
arch/i386/kernel/semaphore.c | 30 ++-
drivers/block/ll_rw_blk.c | 21 +-
fs/aio.c | 371 +++++++++++++++++++++++++++++++++---------
fs/buffer.c | 54 +++++-
fs/ext2/inode.c | 44 +++-
include/asm-i386/semaphore.h | 27 ++-
include/linux/aio.h | 32 +++
include/linux/blkdev.h | 1
include/linux/buffer_head.h | 30 +++
include/linux/errno.h | 1
include/linux/init_task.h | 1
include/linux/pagemap.h | 19 ++
include/linux/sched.h | 2
include/linux/wait.h | 2
include/linux/writeback.h | 4
kernel/fork.c | 9 -
mm/filemap.c | 97 +++++++++-
mm/page-writeback.c | 17 +
19 files changed, 616 insertions(+), 148 deletions(-)
[The patches are also available for download on the
Linux Scalability Effort project site
(http://sourceforge.net/projects/lse)
Categorized under the "aio" release in IO Scalability
section
http://sourceforge.net/project/showfiles.php?group_id=8875]
A rollup version containing all the 7 patches
(aiordwr-rollup.patch) would be made available as well
Major additions/changes since previous versions posted:
------------------------------------------------------
- Introduced _wq versions of low level routines like
lock_page_wq, wait_on_page_bit_wq etc, which take the
wait_queue entry as a parameter (Thanks to Christoph
Hellwig for suggesting the new and much better
names :)).
- Reorganized code to avoid having to use the do_sync_op()
wrapper (because the forced emulation of the i/o wait
context seemed an overhead and not very elegant).
- (New)Implementation of asynchronous semaphore down
operation for x86 (down_wq).
- Have dropped the async block allocation portions from the
async ext2_get_block patch after a discussion with Stephen
Tweedie (the i/o patterns we anticipate are less likely
to extend file sizes)
- Fixes use_mm() to clear lazy tlb setting (I traced some
of the strange hangs I was seeing for large reads to this)
- Removed the aio_run_iocbs() acceleration from io_getevents,
now that the above problem is gone.
Todos/TBDs:
----------
- Support for down_wq on other archs or provide compatibility
definitions for archs where it is not implemented
(Need feedback on this)
- Should the cond_resched() calls in read/write be
converted to retry points (would need ctx specific worker
threads) ?
- Look at async get block implementations for other
filesystems (e.g. jfs) ?
- Optional: Check if it makes sense to use retry model for
o_direct (or change sync o_direct to wait for completion
of async o_direct) ?
- Upgrade to Ben's aio api changes (collapse of api parameters
into an rw_iocb) if and when it gets merged
A few comments on low level implementation details:
--------------------------------------------------
io_wait context
-------------------
The task->io_wait field reflects the wait context in which
a task is executing its i/o operations. For synchronous i/o
task->io_wait is NULL, and the wait context is local on
stack; for threads doing io submits or retries on behalf
of async i/o callers, tsk->io_wait is the wait queue
function entry to be notified on completion of a condition
required for i/o to progress.
Low level _wq routines take a wait queue parameter, so
they can be invoked in either async or sync mode, even
if running in async context (e.g servicing a page fault
during an async retry).
Routines which are expected to be async whenever they are
running in async context and sync when running in sync
context do not need to provide a wait queue parameter.
do_sync_op()
---------------
The do_sync_op() wrappers are not typically needed anymore
for sync versions of the operations; passing NULL to the
corresponding _wq functions suffices.
However, there may be weird cases where we may have several
levels of nesting like:
A()->B()->C()->D()->F()->iowait()
It may seem unnatural to pass a wait queue argument all
the way through, but if we need to force sync behaviour
in a certain case even if it is called under async context,
and have async behaviour in another, then we may need to
resort to using do_sync_op() (e.g if we had kept the
ext2 async block allocation modifications).
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
@ 2003-04-24 5:00 ` Suparna Bhattacharya
2003-04-24 5:04 ` [PATCH 2/7] Filesystem AIO rdwr - aio read Suparna Bhattacharya
` (6 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:00 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> It is built on a variation of the simple retry based
> scheme originally suggested by Ben LaHaise.
>
> 01aioretry.patch : Base aio retry infrastructure
>
fs/aio.c | 371 ++++++++++++++++++++++++++++++++++++----------
include/linux/aio.h | 32 +++
include/linux/errno.h | 1
include/linux/init_task.h | 1
include/linux/sched.h | 2
include/linux/wait.h | 2
kernel/fork.c | 9 -
7 files changed, 337 insertions(+), 81 deletions(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/fs/aio.c linux-aio-2568/fs/aio.c
--- linux-2.5.68/fs/aio.c Fri Apr 11 21:10:48 2003
+++ linux-aio-2568/fs/aio.c Wed Apr 23 17:15:30 2003
@@ -39,6 +39,9 @@
#define dprintk(x...) do { ; } while (0)
#endif
+long aio_run = 0; /* for testing only */
+long aio_wakeups = 0; /* for testing only */
+
/*------ sysctl variables----*/
atomic_t aio_nr = ATOMIC_INIT(0); /* current system wide number of aio requests */
unsigned aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
@@ -281,6 +284,7 @@
struct kiocb *iocb = list_kiocb(pos);
list_del_init(&iocb->ki_list);
cancel = iocb->ki_cancel;
+ kiocbSetCancelled(iocb);
if (cancel) {
iocb->ki_users++;
spin_unlock_irq(&ctx->ctx_lock);
@@ -395,6 +400,7 @@
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_user_obj = NULL;
+ INIT_LIST_HEAD(&req->ki_run_list);
/* Check if the completion queue has enough free space to
* accept an event from this io.
@@ -544,60 +550,147 @@
struct mm_struct *active_mm = current->active_mm;
atomic_inc(&mm->mm_count);
current->mm = mm;
- if (mm != active_mm) {
- current->active_mm = mm;
- activate_mm(active_mm, mm);
- }
+
+ current->active_mm = mm;
+ activate_mm(active_mm, mm);
+
mmdrop(active_mm);
}
-static void unuse_mm(struct mm_struct *mm)
+void unuse_mm(struct mm_struct *mm)
{
current->mm = NULL;
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, current, smp_processor_id());
}
-/* Run on kevent's context. FIXME: needs to be per-cpu and warn if an
- * operation blocks.
- */
-static void aio_kick_handler(void *data)
+static inline int __queue_kicked_iocb(struct kiocb *iocb)
{
- struct kioctx *ctx = data;
+ struct kioctx *ctx = iocb->ki_ctx;
- use_mm(ctx->mm);
+ if (list_empty(&iocb->ki_run_list)) {
+ list_add_tail(&iocb->ki_run_list,
+ &ctx->run_list);
+ iocb->ki_queued++;
+ return 1;
+ }
+ return 0;
+}
- spin_lock_irq(&ctx->ctx_lock);
- while (!list_empty(&ctx->run_list)) {
- struct kiocb *iocb;
- long ret;
+/* Expects to be called with iocb->ki_ctx->lock held */
+static ssize_t aio_run_iocb(struct kiocb *iocb)
+{
+ struct kioctx *ctx = iocb->ki_ctx;
+ ssize_t (*retry)(struct kiocb *);
+ ssize_t ret;
- iocb = list_entry(ctx->run_list.next, struct kiocb,
- ki_run_list);
- list_del(&iocb->ki_run_list);
- iocb->ki_users ++;
- spin_unlock_irq(&ctx->ctx_lock);
+ if (iocb->ki_retried++ > 1024*1024) {
+ printk("Maximal retry count. Bytes done %d\n",
+ iocb->ki_nbytes - iocb->ki_left);
+ return -EAGAIN;
+ }
- kiocbClearKicked(iocb);
- ret = iocb->ki_retry(iocb);
+ if (!(iocb->ki_retried & 0xff)) {
+ printk("%ld retry: %d of %d (kick %ld, Q %ld run %ld, wake %ld)\n",
+ iocb->ki_retried,
+ iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes,
+ iocb->ki_kicked, iocb->ki_queued, aio_run, aio_wakeups);
+ }
+
+ if (!(retry = iocb->ki_retry)) {
+ printk("aio_run_iocb: iocb->ki_retry = NULL\n");
+ return 0;
+ }
+
+ iocb->ki_users ++;
+ kiocbClearKicked(iocb);
+ iocb->ki_run_list.next = iocb->ki_run_list.prev = NULL;
+ iocb->ki_retry = NULL;
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ if (kiocbIsCancelled(iocb)) {
+ aio_complete(iocb, -EINTR, 0);
+ spin_lock_irq(&ctx->ctx_lock);
+ __aio_put_req(ctx, iocb);
+ return -EINTR;
+ }
+
+ BUG_ON(current->io_wait != NULL);
+ current->io_wait = &iocb->ki_wait;
+ ret = retry(iocb);
+ current->io_wait = NULL;
+
+ if (-EIOCBRETRY != ret) {
if (-EIOCBQUEUED != ret) {
+ BUG_ON(!list_empty(&iocb->ki_wait.task_list));
aio_complete(iocb, ret, 0);
- iocb = NULL;
}
+ } else {
+ if (list_empty(&iocb->ki_wait.task_list))
+ kiocbSetKicked(iocb);
+ }
+ spin_lock_irq(&ctx->ctx_lock);
- spin_lock_irq(&ctx->ctx_lock);
- if (NULL != iocb)
- __aio_put_req(ctx, iocb);
+ iocb->ki_retry = retry;
+ INIT_LIST_HEAD(&iocb->ki_run_list);
+ if (kiocbIsKicked(iocb)) {
+ BUG_ON(ret != -EIOCBRETRY);
+ __queue_kicked_iocb(iocb);
+ }
+ __aio_put_req(ctx, iocb);
+ return ret;
+}
+
+static void aio_run_iocbs(struct kioctx *ctx)
+{
+ struct kiocb *iocb;
+ ssize_t ret;
+ int count = 0;
+
+ spin_lock_irq(&ctx->ctx_lock);
+ while (!list_empty(&ctx->run_list)) {
+ iocb = list_entry(ctx->run_list.next, struct kiocb,
+ ki_run_list);
+ list_del(&iocb->ki_run_list);
+ ret = aio_run_iocb(iocb);
+ count++;
}
spin_unlock_irq(&ctx->ctx_lock);
+ aio_run++;
+}
+
+/* Run on aiod/kevent's context. FIXME: needs to be per-cpu and warn if an
+ * operation blocks.
+ */
+static void aio_kick_handler(void *data)
+{
+ struct kioctx *ctx = data;
+ use_mm(ctx->mm);
+ aio_run_iocbs(ctx);
unuse_mm(ctx->mm);
}
-void kick_iocb(struct kiocb *iocb)
+
+void queue_kicked_iocb(struct kiocb *iocb)
{
struct kioctx *ctx = iocb->ki_ctx;
+ unsigned long flags;
+ int run = 0;
+
+ WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+ run = __queue_kicked_iocb(iocb);
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ if (run) {
+ queue_work(aio_wq, &ctx->wq);
+ aio_wakeups++;
+ }
+}
+
+void kick_iocb(struct kiocb *iocb)
+{
/* sync iocbs are easy: they can only ever be executing from a
* single context. */
if (is_sync_kiocb(iocb)) {
@@ -606,12 +699,9 @@
return;
}
+ iocb->ki_kicked++;
if (!kiocbTryKick(iocb)) {
- unsigned long flags;
- spin_lock_irqsave(&ctx->ctx_lock, flags);
- list_add_tail(&iocb->ki_run_list, &ctx->run_list);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- schedule_work(&ctx->wq);
+ queue_kicked_iocb(iocb);
}
}
@@ -664,6 +754,9 @@
*/
spin_lock_irqsave(&ctx->ctx_lock, flags);
+ if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list))
+ list_del_init(&iocb->ki_run_list);
+
ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
tail = info->tail;
@@ -693,6 +786,11 @@
pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+ pr_debug("%ld retries: %d of %d (kicked %ld, Q %ld run %ld wake %ld)\n",
+ iocb->ki_retried,
+ iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes,
+ iocb->ki_kicked, iocb->ki_queued, aio_run, aio_wakeups);
+
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
@@ -807,6 +905,7 @@
int i = 0;
struct io_event ent;
struct timeout to;
+ int event_loop = 0; /* testing only */
/* needed to zero any padding within an entry (there shouldn't be
* any, but C is fun!
@@ -856,7 +955,6 @@
add_wait_queue_exclusive(&ctx->wait, &wait);
do {
set_task_state(tsk, TASK_INTERRUPTIBLE);
-
ret = aio_read_evt(ctx, &ent);
if (ret)
break;
@@ -866,6 +964,7 @@
if (to.timed_out) /* Only check after read evt */
break;
schedule();
+ event_loop++;
if (signal_pending(tsk)) {
ret = -EINTR;
break;
@@ -893,6 +992,9 @@
if (timeout)
clear_timeout(&to);
out:
+ pr_debug("event loop executed %d times\n", event_loop);
+ pr_debug("aio_run %ld\n", aio_run);
+ pr_debug("aio_wakeups %ld\n", aio_wakeups);
return i ? i : ret;
}
@@ -984,6 +1086,143 @@
return -EINVAL;
}
+ssize_t aio_pread(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ ret = file->f_op->aio_read(iocb, iocb->ki_buf,
+ iocb->ki_left, iocb->ki_pos);
+
+ /*
+ * Can't just depend on iocb->ki_left to determine
+ * whether we are done. This may have been a short read.
+ */
+ if (ret > 0) {
+ iocb->ki_buf += ret;
+ iocb->ki_left -= ret;
+
+ ret = -EIOCBRETRY;
+ }
+
+ /* This means we must have transferred all that we could */
+ /* No need to retry anymore */
+ if ((ret == 0) || (iocb->ki_left == 0))
+ ret = iocb->ki_nbytes - iocb->ki_left;
+
+ return ret;
+}
+
+ssize_t aio_pwrite(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ ret = file->f_op->aio_write(iocb, iocb->ki_buf,
+ iocb->ki_left, iocb->ki_pos);
+
+ /*
+ * TBD: Even if iocb->ki_left = 0, could we need to
+ * wait for data to be sync'd ? Or can we assume
+ * that aio_fdsync/aio_fsync would be called explicitly
+ * as required.
+ */
+ if (ret > 0) {
+ iocb->ki_buf += ret;
+ iocb->ki_left -= ret;
+
+ ret = -EIOCBRETRY;
+ }
+
+ /* This means we must have transferred all that we could */
+ /* No need to retry anymore */
+ if (ret == 0)
+ ret = iocb->ki_nbytes - iocb->ki_left;
+
+ return ret;
+}
+
+ssize_t aio_fdsync(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 1);
+ return ret;
+}
+
+ssize_t aio_fsync(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 0);
+ return ret;
+}
+
+/* Called during initial submission and subsequent retry operations */
+ssize_t aio_setup_iocb(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ switch (iocb->ki_opcode) {
+ case IOCB_CMD_PREAD:
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_READ)))
+ break;
+ ret = -EFAULT;
+ if (unlikely(!access_ok(VERIFY_WRITE, iocb->ki_buf,
+ iocb->ki_left)))
+ break;
+ ret = -EINVAL;
+ if (file->f_op->aio_read)
+ iocb->ki_retry = aio_pread;
+ break;
+ case IOCB_CMD_PWRITE:
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
+ break;
+ ret = -EFAULT;
+ if (unlikely(!access_ok(VERIFY_READ, iocb->ki_buf,
+ iocb->ki_left)))
+ break;
+ ret = -EINVAL;
+ if (file->f_op->aio_write)
+ iocb->ki_retry = aio_pwrite;
+ break;
+ case IOCB_CMD_FDSYNC:
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ iocb->ki_retry = aio_fdsync;
+ break;
+ case IOCB_CMD_FSYNC:
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ iocb->ki_retry = aio_fsync;
+ break;
+ default:
+ dprintk("EINVAL: io_submit: no operation provided\n");
+ ret = -EINVAL;
+ }
+
+ if (!iocb->ki_retry)
+ return ret;
+
+ return 0;
+}
+
+int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+{
+ struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+
+ list_del_init(&wait->task_list);
+ kick_iocb(iocb);
+ return 1;
+}
+
int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
struct iocb *iocb));
int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -992,7 +1231,6 @@
struct kiocb *req;
struct file *file;
ssize_t ret;
- char *buf;
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1033,51 +1271,31 @@
req->ki_user_data = iocb->aio_data;
req->ki_pos = iocb->aio_offset;
- buf = (char *)(unsigned long)iocb->aio_buf;
+ req->ki_buf = (char *)(unsigned long)iocb->aio_buf;
+ req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+ req->ki_opcode = iocb->aio_lio_opcode;
+ init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
+ INIT_LIST_HEAD(&req->ki_wait.task_list);
+ req->ki_run_list.next = req->ki_run_list.prev = NULL;
+ req->ki_retry = NULL;
+ req->ki_retried = 0;
+ req->ki_kicked = 0;
+ req->ki_queued = 0;
+ aio_run = 0;
+ aio_wakeups = 0;
- switch (iocb->aio_lio_opcode) {
- case IOCB_CMD_PREAD:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_READ)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_read)
- ret = file->f_op->aio_read(req, buf,
- iocb->aio_nbytes, req->ki_pos);
- break;
- case IOCB_CMD_PWRITE:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_WRITE)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_write)
- ret = file->f_op->aio_write(req, buf,
- iocb->aio_nbytes, req->ki_pos);
- break;
- case IOCB_CMD_FDSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 1);
- break;
- case IOCB_CMD_FSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 0);
- break;
- default:
- dprintk("EINVAL: io_submit: no operation provided\n");
- ret = -EINVAL;
- }
+ ret = aio_setup_iocb(req);
+
+ if ((-EBADF == ret) || (-EFAULT == ret))
+ goto out_put_req;
+
+ spin_lock_irq(&ctx->ctx_lock);
+ ret = aio_run_iocb(req);
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ if (-EIOCBRETRY == ret)
+ queue_work(aio_wq, &ctx->wq);
- if (likely(-EIOCBQUEUED == ret))
- return 0;
- aio_complete(req, ret, 0);
return 0;
out_put_req:
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/aio.h linux-aio-2568/include/linux/aio.h
--- linux-2.5.68/include/linux/aio.h Fri Apr 11 21:10:50 2003
+++ linux-aio-2568/include/linux/aio.h Thu Apr 17 18:03:53 2003
@@ -54,7 +54,7 @@
struct file *ki_filp;
struct kioctx *ki_ctx; /* may be NULL for sync ops */
int (*ki_cancel)(struct kiocb *, struct io_event *);
- long (*ki_retry)(struct kiocb *);
+ ssize_t (*ki_retry)(struct kiocb *);
struct list_head ki_list; /* the aio core uses this
* for cancellation */
@@ -62,6 +62,16 @@
void *ki_user_obj; /* pointer to userland's iocb */
__u64 ki_user_data; /* user's data for completion */
loff_t ki_pos;
+
+ /* State that we remember to be able to restart/retry */
+ unsigned short ki_opcode;
+ size_t ki_nbytes; /* copy of iocb->aio_nbytes */
+ char *ki_buf; /* remaining iocb->aio_buf */
+ size_t ki_left; /* remaining bytes */
+ wait_queue_t ki_wait;
+ long ki_retried; /* just for testing */
+ long ki_kicked; /* just for testing */
+ long ki_queued; /* just for testing */
char private[KIOCB_PRIVATE_SIZE];
};
@@ -77,6 +87,8 @@
(x)->ki_ctx = &tsk->active_mm->default_kioctx; \
(x)->ki_cancel = NULL; \
(x)->ki_user_obj = tsk; \
+ (x)->ki_user_data = 0; \
+ init_wait((&(x)->ki_wait)); \
} while (0)
#define AIO_RING_MAGIC 0xa10a10a1
@@ -156,6 +168,24 @@
#define get_ioctx(kioctx) do { if (unlikely(atomic_read(&(kioctx)->users) <= 0)) BUG(); atomic_inc(&(kioctx)->users); } while (0)
#define put_ioctx(kioctx) do { if (unlikely(atomic_dec_and_test(&(kioctx)->users))) __put_ioctx(kioctx); else if (unlikely(atomic_read(&(kioctx)->users) < 0)) BUG(); } while (0)
+#define in_aio() !is_sync_wait(current->io_wait)
+
+/* when sync behaviour is desired even if running in async context */
+#define do_sync_op(op) if (in_aio()) { \
+ wait_queue_t *wait = current->io_wait; \
+ current->io_wait = NULL; \
+ op; \
+ current->io_wait = wait; \
+} else { \
+ op; \
+}
+
+#define warn_if_async() if (in_aio()) {\
+ printk(KERN_ERR "%s(%s:%d) called in async context!\n", \
+ __FUNCTION__, __FILE__, __LINE__); \
+ dump_stack(); \
+ }
+
#include <linux/aio_abi.h>
static inline struct kiocb *list_kiocb(struct list_head *h)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/errno.h linux-aio-2568/include/linux/errno.h
--- linux-2.5.68/include/linux/errno.h Tue Mar 25 03:31:17 2003
+++ linux-aio-2568/include/linux/errno.h Wed Apr 16 16:41:23 2003
@@ -22,6 +22,7 @@
#define EBADTYPE 527 /* Type not supported by server */
#define EJUKEBOX 528 /* Request initiated, but will not complete before timeout */
#define EIOCBQUEUED 529 /* iocb queued, will get completion event */
+#define EIOCBRETRY 530 /* iocb queued, will trigger a retry */
#endif
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/init_task.h linux-aio-2568/include/linux/init_task.h
--- linux-2.5.68/include/linux/init_task.h Mon Apr 21 23:30:39 2003
+++ linux-aio-2568/include/linux/init_task.h Mon Apr 21 17:26:58 2003
@@ -103,6 +103,7 @@
.alloc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .io_wait = NULL, \
}
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/sched.h linux-aio-2568/include/linux/sched.h
--- linux-2.5.68/include/linux/sched.h Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/include/linux/sched.h Mon Apr 21 17:26:58 2003
@@ -438,6 +438,8 @@
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+/* current io wait handle */
+ wait_queue_t *io_wait;
};
extern void __put_task_struct(struct task_struct *tsk);
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/wait.h linux-aio-2568/include/linux/wait.h
--- linux-2.5.68/include/linux/wait.h Tue Mar 25 03:30:44 2003
+++ linux-aio-2568/include/linux/wait.h Tue Apr 22 00:07:42 2003
@@ -80,6 +80,8 @@
return !list_empty(&q->task_list);
}
+#define is_sync_wait(wait) (!(wait) || ((wait)->task))
+
extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait));
extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
diff -ur -X /home/kiran/dontdiff linux-2.5.68/kernel/fork.c linux-aio-2568/kernel/fork.c
--- linux-2.5.68/kernel/fork.c Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/kernel/fork.c Mon Apr 21 17:26:58 2003
@@ -139,8 +139,9 @@
void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
{
unsigned long flags;
-
- __set_current_state(state);
+
+ if (is_sync_wait(wait))
+ __set_current_state(state);
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
@@ -153,7 +154,8 @@
{
unsigned long flags;
- __set_current_state(state);
+ if (is_sync_wait(wait))
+ __set_current_state(state);
wait->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
@@ -856,6 +858,7 @@
p->lock_depth = -1; /* -1 = no lock */
p->start_time = get_jiffies_64();
p->security = NULL;
+ p->io_wait = NULL;
retval = -ENOMEM;
if (security_task_alloc(p))
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/7] Filesystem AIO rdwr - aio read
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
2003-04-24 5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
@ 2003-04-24 5:04 ` Suparna Bhattacharya
2003-04-24 5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
` (5 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:04 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> 02aiord.patch : Filesystem aio read
Changes to generic_file_aio_read code to use the aio
retry infrastructure. The main blocking points addressed
are lock_page and wait_on_page_locked.
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
02aiord.patch
.............
include/linux/pagemap.h | 19 ++++++++++++++
mm/filemap.c | 63 ++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 72 insertions(+), 10 deletions(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/pagemap.h linux-aio-2568/include/linux/pagemap.h
--- linux-2.5.68/include/linux/pagemap.h Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/include/linux/pagemap.h Mon Apr 21 17:26:58 2003
@@ -135,6 +135,16 @@
if (TestSetPageLocked(page))
__lock_page(page);
}
+
+extern int FASTCALL(__lock_page_wq(struct page *page, wait_queue_t *wait));
+static inline int lock_page_wq(struct page *page, wait_queue_t *wait)
+{
+ if (TestSetPageLocked(page))
+ return __lock_page_wq(page, wait);
+ else
+ return 0;
+}
+
/*
* This is exported only for wait_on_page_locked/wait_on_page_writeback.
@@ -155,6 +165,15 @@
wait_on_page_bit(page, PG_locked);
}
+extern int FASTCALL(wait_on_page_bit_wq(struct page *page, int bit_nr,
+ wait_queue_t *wait));
+static inline int wait_on_page_locked_wq(struct page *page, wait_queue_t *wait)
+{
+ if (PageLocked(page))
+ return wait_on_page_bit_wq(page, PG_locked, wait);
+ return 0;
+}
+
/*
* Wait for a page to complete writeback
*/
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/filemap.c linux-aio-2568/mm/filemap.c
--- linux-2.5.68/mm/filemap.c Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/filemap.c Tue Apr 22 00:28:55 2003
@@ -268,19 +268,32 @@
return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}
-void wait_on_page_bit(struct page *page, int bit_nr)
+int wait_on_page_bit_wq(struct page *page, int bit_nr, wait_queue_t *wait)
{
wait_queue_head_t *waitqueue = page_waitqueue(page);
- DEFINE_WAIT(wait);
+ DEFINE_WAIT(local_wait);
+ if (!wait)
+ wait = &local_wait;
+
do {
- prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(waitqueue, wait, TASK_UNINTERRUPTIBLE);
if (test_bit(bit_nr, &page->flags)) {
sync_page(page);
+ if (!is_sync_wait(wait))
+ return -EIOCBRETRY;
io_schedule();
}
} while (test_bit(bit_nr, &page->flags));
- finish_wait(waitqueue, &wait);
+ finish_wait(waitqueue, wait);
+
+ return 0;
+}
+EXPORT_SYMBOL(wait_on_page_bit_wq);
+
+void wait_on_page_bit(struct page *page, int bit_nr)
+{
+ wait_on_page_bit_wq(page, bit_nr, NULL);
}
EXPORT_SYMBOL(wait_on_page_bit);
@@ -336,19 +349,31 @@
* chances are that on the second loop, the block layer's plug list is empty,
* so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
*/
-void __lock_page(struct page *page)
+int __lock_page_wq(struct page *page, wait_queue_t *wait)
{
wait_queue_head_t *wqh = page_waitqueue(page);
- DEFINE_WAIT(wait);
+ DEFINE_WAIT(local_wait);
+ if (!wait)
+ wait = &local_wait;
+
while (TestSetPageLocked(page)) {
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
if (PageLocked(page)) {
sync_page(page);
+ if (!is_sync_wait(wait))
+ return -EIOCBRETRY;
io_schedule();
}
}
- finish_wait(wqh, &wait);
+ finish_wait(wqh, wait);
+ return 0;
+}
+EXPORT_SYMBOL(__lock_page_wq);
+
+void __lock_page(struct page *page)
+{
+ __lock_page_wq(page, NULL);
}
EXPORT_SYMBOL(__lock_page);
@@ -621,7 +655,13 @@
goto page_ok;
/* Get exclusive access to the page ... */
- lock_page(page);
+
+ if (lock_page_wq(page, current->io_wait)) {
+ pr_debug("queued lock page \n");
+ error = -EIOCBRETRY;
+ /* TBD: should we hold on to the cached page ? */
+ goto sync_error;
+ }
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
@@ -643,12 +683,19 @@
if (!error) {
if (PageUptodate(page))
goto page_ok;
- wait_on_page_locked(page);
+ if (wait_on_page_locked_wq(page, current->io_wait)) {
+ pr_debug("queued wait_on_page \n");
+ error = -EIOCBRETRY;
+ /*TBD:should we hold on to the cached page ?*/
+ goto sync_error;
+ }
+
if (PageUptodate(page))
goto page_ok;
error = -EIO;
}
+sync_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
page_cache_release(page);
@@ -819,6 +866,10 @@
struct kiocb kiocb;
ssize_t ret;
+ if (current->io_wait != NULL) {
+ printk("current->io_wait != NULL\n");
+ dump_stack();
+ }
init_sync_kiocb(&kiocb, filp);
ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
if (-EIOCBQUEUED == ret)
@@ -851,6 +902,7 @@
{
read_descriptor_t desc;
+ BUG_ON(current->io_wait != NULL);
if (!count)
return 0;
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Filesystem AIO read-write patches
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
2003-04-24 5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
2003-04-24 5:04 ` [PATCH 2/7] Filesystem AIO rdwr - aio read Suparna Bhattacharya
@ 2003-04-24 5:11 ` Suparna Bhattacharya
2003-04-24 12:48 ` Matthew Wilcox
2003-04-24 5:16 ` [PATCH 4/7] Filesystem AIO rdwr - async down (x86) Suparna Bhattacharya
` (4 subsequent siblings)
7 siblings, 1 reply; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:11 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> 03aiowr.patch : Minimal filesystem aio write
> (for all archs and all filesystems
> using generic_file_aio_write)
Changes to enable generic_file_aio_write use the aio
retry infrastructure. The blocking points addressed are
in find_lock_page and blk_congestion_wait.
(Async down and async get block are available as optional
follow on patches)
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
03aiowr.patch
..............
drivers/block/ll_rw_blk.c | 21 +++++++++++++++++----
include/linux/blkdev.h | 1 +
include/linux/writeback.h | 4 ++--
mm/filemap.c | 31 ++++++++++++++++++++++++++-----
mm/page-writeback.c | 17 ++++++++++++-----
5 files changed, 58 insertions(+), 16 deletions(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/drivers/block/ll_rw_blk.c linux-aio-2568/drivers/block/ll_rw_blk.c
--- linux-2.5.68/drivers/block/ll_rw_blk.c Fri Apr 11 21:10:45 2003
+++ linux-aio-2568/drivers/block/ll_rw_blk.c Wed Apr 16 23:47:25 2003
@@ -1564,17 +1564,30 @@
* If no queues are congested then just wait for the next request to be
* returned.
*/
-void blk_congestion_wait(int rw, long timeout)
+int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait)
{
- DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[rw];
+ DEFINE_WAIT(local_wait);
+
+ if (!wait)
+ wait = &local_wait;
blk_run_queues();
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
+ if (!is_sync_wait(wait))
+ return -EIOCBRETRY;
+
io_schedule_timeout(timeout);
- finish_wait(wqh, &wait);
+ finish_wait(wqh, wait);
+ return 0;
+}
+
+void blk_congestion_wait(int rw, long timeout)
+{
+ blk_congestion_wait_wq(rw, timeout, NULL);
}
+
/*
* Has to be called with the request spinlock acquired
*/
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/blkdev.h linux-aio-2568/include/linux/blkdev.h
--- linux-2.5.68/include/linux/blkdev.h Fri Apr 11 21:10:51 2003
+++ linux-aio-2568/include/linux/blkdev.h Wed Apr 16 16:31:12 2003
@@ -390,6 +390,7 @@
extern void blk_queue_free_tags(request_queue_t *);
extern void blk_queue_invalidate_tags(request_queue_t *);
extern void blk_congestion_wait(int rw, long timeout);
+extern int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait);
#define MAX_PHYS_SEGMENTS 128
#define MAX_HW_SEGMENTS 128
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/writeback.h linux-aio-2568/include/linux/writeback.h
--- linux-2.5.68/include/linux/writeback.h Tue Mar 25 03:30:01 2003
+++ linux-aio-2568/include/linux/writeback.h Wed Apr 16 16:31:12 2003
@@ -80,8 +80,8 @@
void page_writeback_init(void);
-void balance_dirty_pages(struct address_space *mapping);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+int balance_dirty_pages(struct address_space *mapping);
+int balance_dirty_pages_ratelimited(struct address_space *mapping);
int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/filemap.c linux-aio-2568/mm/filemap.c
--- linux-2.5.68/mm/filemap.c Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/filemap.c Tue Apr 22 00:28:55 2003
@@ -398,8 +423,8 @@
*
* Returns zero if the page was not present. find_lock_page() may sleep.
*/
-struct page *find_lock_page(struct address_space *mapping,
- unsigned long offset)
+struct page *find_lock_page_wq(struct address_space *mapping,
+ unsigned long offset, wait_queue_t *wait)
{
struct page *page;
@@ -410,7 +435,10 @@
page_cache_get(page);
if (TestSetPageLocked(page)) {
spin_unlock(&mapping->page_lock);
- lock_page(page);
+ if (-EIOCBRETRY == lock_page_wq(page, wait)) {
+ page_cache_release(page);
+ return ERR_PTR(-EIOCBRETRY);
+ }
spin_lock(&mapping->page_lock);
/* Has the page been truncated while we slept? */
@@ -425,6 +453,12 @@
return page;
}
+struct page *find_lock_page(struct address_space *mapping,
+ unsigned long offset)
+{
+ return find_lock_page_wq(mapping, offset, NULL);
+}
+
/**
* find_or_create_page - locate or add a pagecache page
*
@@ -1373,7 +1425,9 @@
int err;
struct page *page;
repeat:
- page = find_lock_page(mapping, index);
+ page = find_lock_page_wq(mapping, index, current->io_wait);
+ if (IS_ERR(page))
+ return page;
if (!page) {
if (!*cached_page) {
*cached_page = page_cache_alloc(mapping);
@@ -1690,6 +1744,10 @@
fault_in_pages_readable(buf, bytes);
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+ if (IS_ERR(page)) {
+ status = PTR_ERR(page);
+ break;
+ }
if (!page) {
status = -ENOMEM;
break;
@@ -1697,6 +1755,8 @@
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
+ if (-EIOCBRETRY == status)
+ pr_debug("queued prepare_write\n");
/*
* prepare_write() may have instantiated a few blocks
* outside i_size. Trim these off again.
@@ -1737,7 +1797,11 @@
page_cache_release(page);
if (status < 0)
break;
- balance_dirty_pages_ratelimited(mapping);
+ status = balance_dirty_pages_ratelimited(mapping);
+ if (status < 0) {
+ pr_debug("async balance_dirty_pages\n");
+ break;
+ }
cond_resched();
} while (count);
*ppos = pos;
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/page-writeback.c linux-aio-2568/mm/page-writeback.c
--- linux-2.5.68/mm/page-writeback.c Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/page-writeback.c Tue Apr 22 01:31:35 2003
@@ -135,7 +135,7 @@
* If we're over `background_thresh' then pdflush is woken to perform some
* writeout.
*/
-void balance_dirty_pages(struct address_space *mapping)
+int balance_dirty_pages(struct address_space *mapping)
{
struct page_state ps;
long nr_reclaimable;
@@ -152,6 +152,7 @@
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = write_chunk,
+ .nonblocking = !is_sync_wait(current->io_wait)
};
get_dirty_limits(&ps, &background_thresh, &dirty_thresh);
@@ -178,7 +179,11 @@
if (pages_written >= write_chunk)
break; /* We've done our duty */
}
- blk_congestion_wait(WRITE, HZ/10);
+ if (-EIOCBRETRY == blk_congestion_wait_wq(WRITE, HZ/10,
+ current->io_wait)) {
+ pr_debug("async blk congestion wait\n");
+ return -EIOCBRETRY;
+ }
}
if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)
@@ -186,6 +191,8 @@
if (!writeback_in_progress(bdi) && nr_reclaimable > background_thresh)
pdflush_operation(background_writeout, 0);
+
+ return 0;
}
/**
@@ -201,7 +208,7 @@
* decrease the ratelimiting by a lot, to prevent individual processes from
* overshooting the limit by (ratelimit_pages) each.
*/
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+int balance_dirty_pages_ratelimited(struct address_space *mapping)
{
static DEFINE_PER_CPU(int, ratelimits) = 0;
int cpu;
@@ -215,10 +222,10 @@
if (per_cpu(ratelimits, cpu)++ >= ratelimit) {
per_cpu(ratelimits, cpu) = 0;
put_cpu();
- balance_dirty_pages(mapping);
- return;
+ return balance_dirty_pages(mapping);
}
put_cpu();
+ return 0;
}
EXPORT_SYMBOL_GPL(balance_dirty_pages_ratelimited);
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 4/7] Filesystem AIO rdwr - async down (x86)
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
` (2 preceding siblings ...)
2003-04-24 5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
@ 2003-04-24 5:16 ` Suparna Bhattacharya
2003-04-24 5:19 ` [5/7] Filesystem AIO rdwr - use down_wq for aio write Suparna Bhattacharya
` (3 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:16 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> 04down_wq-86.patch : An asynchronous semaphore down
> implementation (currently x86
> only)
The semaphore wait seems to be embedded deep in arch
specific code (am not sure that really needed to be
arch specific, though the asm wrappers would have still
been needed for every arch).
So I've done this for x86 only at present. Need to
extend this to other archs (if we see good performance
gains) for put in compat definitions to avoid breaking
any arch.
04aiodown_wq-x86.patch
......................
arch/i386/kernel/i386_ksyms.c | 2 +-
arch/i386/kernel/semaphore.c | 30 ++++++++++++++++++++----------
include/asm-i386/semaphore.h | 27 ++++++++++++++++++---------
3 files changed, 39 insertions(+), 20 deletions(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/arch/i386/kernel/i386_ksyms.c linux-aio-2568/arch/i386/kernel/i386_ksyms.c
--- linux-2.5.68/arch/i386/kernel/i386_ksyms.c Mon Apr 21 23:30:26 2003
+++ linux-aio-2568/arch/i386/kernel/i386_ksyms.c Mon Apr 21 17:26:55 2003
@@ -98,7 +98,7 @@
EXPORT_SYMBOL(__io_virt_debug);
#endif
-EXPORT_SYMBOL_NOVERS(__down_failed);
+EXPORT_SYMBOL_NOVERS(__down_failed_wq);
EXPORT_SYMBOL_NOVERS(__down_failed_interruptible);
EXPORT_SYMBOL_NOVERS(__down_failed_trylock);
EXPORT_SYMBOL_NOVERS(__up_wakeup);
diff -ur -X /home/kiran/dontdiff linux-2.5.68/arch/i386/kernel/semaphore.c linux-aio-2568/arch/i386/kernel/semaphore.c
--- linux-2.5.68/arch/i386/kernel/semaphore.c Tue Mar 25 03:30:20 2003
+++ linux-aio-2568/arch/i386/kernel/semaphore.c Tue Apr 22 01:29:29 2003
@@ -15,6 +15,7 @@
#include <linux/config.h>
#include <linux/sched.h>
#include <linux/err.h>
+#include <linux/errno.h>
#include <asm/semaphore.h>
/*
@@ -53,15 +54,20 @@
wake_up(&sem->wait);
}
-void __down(struct semaphore * sem)
+int __down_wq(struct semaphore * sem, wait_queue_t *wait)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DECLARE_WAITQUEUE(local_wait, tsk);
unsigned long flags;
- tsk->state = TASK_UNINTERRUPTIBLE;
+ if (is_sync_wait(wait))
+ tsk->state = TASK_UNINTERRUPTIBLE;
+ if (!wait) {
+ wait = &local_wait;
+ }
+
spin_lock_irqsave(&sem->wait.lock, flags);
- add_wait_queue_exclusive_locked(&sem->wait, &wait);
+ add_wait_queue_exclusive_locked(&sem->wait, wait);
sem->sleepers++;
for (;;) {
@@ -79,17 +85,23 @@
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
+ if (!is_sync_wait(wait))
+ return -EIOCBRETRY;
+
schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_UNINTERRUPTIBLE;
}
- remove_wait_queue_locked(&sem->wait, &wait);
+ if (is_sync_wait(wait) || !list_empty(&wait->task_list))
+ remove_wait_queue_locked(&sem->wait, wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
tsk->state = TASK_RUNNING;
+ return 0;
}
+
int __down_interruptible(struct semaphore * sem)
{
int retval = 0;
@@ -189,19 +201,17 @@
asm(
".text\n"
".align 4\n"
-".globl __down_failed\n"
-"__down_failed:\n\t"
+".globl __down_failed_wq\n"
+"__down_failed_wq:\n\t"
#if defined(CONFIG_FRAME_POINTER)
"pushl %ebp\n\t"
"movl %esp,%ebp\n\t"
#endif
- "pushl %eax\n\t"
"pushl %edx\n\t"
"pushl %ecx\n\t"
- "call __down\n\t"
+ "call __down_wq\n\t"
"popl %ecx\n\t"
"popl %edx\n\t"
- "popl %eax\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
"popl %ebp\n\t"
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/asm-i386/semaphore.h linux-aio-2568/include/asm-i386/semaphore.h
--- linux-2.5.68/include/asm-i386/semaphore.h Tue Mar 25 03:30:11 2003
+++ linux-aio-2568/include/asm-i386/semaphore.h Wed Apr 16 17:57:05 2003
@@ -96,39 +96,48 @@
sema_init(sem, 0);
}
-asmlinkage void __down_failed(void /* special register calling convention */);
+asmlinkage int __down_failed_wq(void /* special register calling convention */);
asmlinkage int __down_failed_interruptible(void /* params in registers */);
asmlinkage int __down_failed_trylock(void /* params in registers */);
asmlinkage void __up_wakeup(void /* special register calling convention */);
-asmlinkage void __down(struct semaphore * sem);
+asmlinkage int __down_wq(struct semaphore * sem, wait_queue_t *wait);
asmlinkage int __down_interruptible(struct semaphore * sem);
asmlinkage int __down_trylock(struct semaphore * sem);
asmlinkage void __up(struct semaphore * sem);
/*
* This is ugly, but we want the default case to fall through.
- * "__down_failed" is a special asm handler that calls the C
+ * "__down_failed_wq" is a special asm handler that calls the C
* routine that actually waits. See arch/i386/kernel/semaphore.c
*/
-static inline void down(struct semaphore * sem)
+static inline int down_wq(struct semaphore * sem, wait_queue_t *wait)
{
+ int result;
+
#ifdef WAITQUEUE_DEBUG
CHECK_MAGIC(sem->__magic);
#endif
might_sleep();
__asm__ __volatile__(
"# atomic down operation\n\t"
- LOCK "decl %0\n\t" /* --sem->count */
- "js 2f\n"
+ LOCK "decl %1\n\t" /* --sem->count */
+ "js 2f\n\t"
+ "xorl %0,%0\n"
"1:\n"
LOCK_SECTION_START("")
- "2:\tcall __down_failed\n\t"
+ "2:\tcall __down_failed_wq\n\t"
"jmp 1b\n"
LOCK_SECTION_END
- :"=m" (sem->count)
- :"c" (sem)
+ :"=a" (result), "=m" (sem->count)
+ :"c" (sem), "d" (wait)
:"memory");
+ return result;
+}
+
+static inline void down(struct semaphore * sem)
+{
+ down_wq(sem, NULL);
}
/*
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 11+ messages in thread
* [5/7] Filesystem AIO rdwr - use down_wq for aio write
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
` (3 preceding siblings ...)
2003-04-24 5:16 ` [PATCH 4/7] Filesystem AIO rdwr - async down (x86) Suparna Bhattacharya
@ 2003-04-24 5:19 ` Suparna Bhattacharya
2003-04-24 5:21 ` [PATCH 6/7] Filesystem AIO rdwr - async bread Suparna Bhattacharya
` (2 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:19 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> 05aiowrdown_wq.patch : Uses async down for aio write
Will currently compile only for x86, since down_wq
support isn't implemented on other archs.
Regards
Suparna
05aiowrdown_wq.patch
....................
filemap.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletion(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/mm/filemap.c linux-aio-2568/mm/filemap.c
--- linux-2.5.68/mm/filemap.c Mon Apr 21 23:30:40 2003
+++ linux-aio-2568/mm/filemap.c Tue Apr 22 00:28:55 2003
@@ -1786,7 +1851,8 @@
BUG_ON(iocb->ki_pos != pos);
- down(&inode->i_sem);
+ if ((err = down_wq(&inode->i_sem, current->io_wait)))
+ return err;
err = generic_file_aio_write_nolock(iocb, &local_iov, 1,
&iocb->ki_pos);
up(&inode->i_sem);
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 6/7] Filesystem AIO rdwr - async bread
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
` (4 preceding siblings ...)
2003-04-24 5:19 ` [5/7] Filesystem AIO rdwr - use down_wq for aio write Suparna Bhattacharya
@ 2003-04-24 5:21 ` Suparna Bhattacharya
2003-04-24 5:28 ` [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2 Suparna Bhattacharya
2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:21 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> 06bread_wq.patch : Async bread implementation
>
Pre-req for the filesystem specific async get block
patches.
Regards
Suparna
06bread_wq.patch
................
fs/buffer.c | 54 +++++++++++++++++++++++++++++++++++++-------
include/linux/buffer_head.h | 30 ++++++++++++++++++++++--
2 files changed, 73 insertions(+), 11 deletions(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/fs/buffer.c linux-aio-2568/fs/buffer.c
--- linux-2.5.68/fs/buffer.c Mon Apr 21 23:30:33 2003
+++ linux-aio-2568/fs/buffer.c Mon Apr 21 17:26:58 2003
@@ -118,23 +118,35 @@
* from becoming locked again - you have to lock it yourself
* if you want to preserve its state.
*/
-void __wait_on_buffer(struct buffer_head * bh)
+int __wait_on_buffer_wq(struct buffer_head * bh, wait_queue_t *wait)
{
wait_queue_head_t *wqh = bh_waitq_head(bh);
- DEFINE_WAIT(wait);
+ DEFINE_WAIT(local_wait);
+
+ if (!wait)
+ wait = &local_wait;
if (atomic_read(&bh->b_count) == 0 &&
(!bh->b_page || !PageLocked(bh->b_page)))
buffer_error();
do {
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
if (buffer_locked(bh)) {
blk_run_queues();
+ if (!is_sync_wait(wait)) {
+ return -EIOCBRETRY;
+ }
io_schedule();
}
} while (buffer_locked(bh));
- finish_wait(wqh, &wait);
+ finish_wait(wqh, wait);
+ return 0;
+}
+
+void __wait_on_buffer(struct buffer_head * bh)
+{
+ __wait_on_buffer_wq(bh, NULL);
}
static void
@@ -1190,9 +1202,12 @@
__brelse(bh);
}
-static struct buffer_head *__bread_slow(struct buffer_head *bh)
+static struct buffer_head *__bread_slow_wq(struct buffer_head *bh,
+ wait_queue_t *wait)
{
- lock_buffer(bh);
+ if (-EIOCBRETRY == lock_buffer_wq(bh, wait))
+ return ERR_PTR(-EIOCBRETRY);
+
if (buffer_uptodate(bh)) {
unlock_buffer(bh);
return bh;
@@ -1202,7 +1217,8 @@
get_bh(bh);
bh->b_end_io = end_buffer_io_sync;
submit_bh(READ, bh);
- wait_on_buffer(bh);
+ if (-EIOCBRETRY == wait_on_buffer_wq(bh, wait))
+ return ERR_PTR(-EIOCBRETRY);
if (buffer_uptodate(bh))
return bh;
}
@@ -1210,6 +1226,11 @@
return NULL;
}
+static inline struct buffer_head *__bread_slow(struct buffer_head *bh)
+{
+ return __bread_slow_wq(bh, NULL);
+}
+
/*
* Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block().
* The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their
@@ -1384,6 +1405,18 @@
bh = __bread_slow(bh);
return bh;
}
+
+
+struct buffer_head *
+__bread_wq(struct block_device *bdev, sector_t block, int size,
+ wait_queue_t *wait)
+{
+ struct buffer_head *bh = __getblk(bdev, block, size);
+
+ if (!buffer_uptodate(bh))
+ bh = __bread_slow_wq(bh, wait);
+ return bh;
+}
EXPORT_SYMBOL(__bread);
/*
@@ -1822,8 +1855,11 @@
clear_buffer_new(bh);
if (!buffer_mapped(bh)) {
err = get_block(inode, block, bh, 1);
- if (err)
+ if (err) {
+ if (-EIOCBRETRY == err)
+ pr_debug("get_block queued\n");
goto out;
+ }
if (buffer_new(bh)) {
clear_buffer_new(bh);
unmap_underlying_metadata(bh->b_bdev,
@@ -1865,6 +1901,8 @@
* If we issued read requests - let them complete.
*/
while(wait_bh > wait) {
+ if (!is_sync_wait(current->io_wait))
+ printk("block_prepare_write: wait on buffer\n");
wait_on_buffer(*--wait_bh);
if (!buffer_uptodate(*wait_bh))
return -EIO;
diff -ur -X /home/kiran/dontdiff linux-2.5.68/include/linux/buffer_head.h linux-aio-2568/include/linux/buffer_head.h
--- linux-2.5.68/include/linux/buffer_head.h Fri Apr 11 21:10:51 2003
+++ linux-aio-2568/include/linux/buffer_head.h Wed Apr 16 16:31:12 2003
@@ -162,6 +162,7 @@
void __invalidate_buffers(kdev_t dev, int);
int sync_blockdev(struct block_device *bdev);
void __wait_on_buffer(struct buffer_head *);
+int __wait_on_buffer_wq(struct buffer_head *, wait_queue_t *wait);
wait_queue_head_t *bh_waitq_head(struct buffer_head *bh);
void wake_up_buffer(struct buffer_head *bh);
int fsync_bdev(struct block_device *);
@@ -172,6 +173,8 @@
void __brelse(struct buffer_head *);
void __bforget(struct buffer_head *);
struct buffer_head *__bread(struct block_device *, sector_t block, int size);
+struct buffer_head *__bread_wq(struct block_device *, sector_t block,
+ int size, wait_queue_t *wait);
struct buffer_head *alloc_buffer_head(void);
void free_buffer_head(struct buffer_head * bh);
void FASTCALL(unlock_buffer(struct buffer_head *bh));
@@ -229,13 +232,13 @@
static inline void brelse(struct buffer_head *bh)
{
- if (bh)
+ if (bh && !IS_ERR(bh))
__brelse(bh);
}
static inline void bforget(struct buffer_head *bh)
{
- if (bh)
+ if (bh && !IS_ERR(bh))
__bforget(bh);
}
@@ -246,7 +249,12 @@
}
static inline struct buffer_head *
-sb_getblk(struct super_block *sb, sector_t block)
+sb_bread_wq(struct super_block *sb, sector_t block, wait_queue_t *wait)
+{
+ return __bread_wq(sb->s_bdev, block, sb->s_blocksize, wait);
+}
+
+static inline struct buffer_head *sb_getblk(struct super_block *sb, sector_t block)
{
return __getblk(sb->s_bdev, block, sb->s_blocksize);
}
@@ -276,10 +284,26 @@
__wait_on_buffer(bh);
}
+static inline int wait_on_buffer_wq(struct buffer_head *bh, wait_queue_t *wait)
+{
+ if (buffer_locked(bh))
+ return __wait_on_buffer_wq(bh, wait);
+
+ return 0;
+}
+
static inline void lock_buffer(struct buffer_head *bh)
{
while (test_set_buffer_locked(bh))
__wait_on_buffer(bh);
}
+static inline int lock_buffer_wq(struct buffer_head *bh, wait_queue_t *wait)
+{
+ if (test_set_buffer_locked(bh))
+ return __wait_on_buffer_wq(bh, wait);
+
+ return 0;
+}
+
#endif /* _LINUX_BUFFER_HEAD_H */
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
` (5 preceding siblings ...)
2003-04-24 5:21 ` [PATCH 6/7] Filesystem AIO rdwr - async bread Suparna Bhattacharya
@ 2003-04-24 5:28 ` Suparna Bhattacharya
2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
7 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 5:28 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> Here is a revised version of the filesystem AIO patches
> for 2.5.68.
>
> 07ext2getblk_wq.patch : Async get block support for
> the ext2 filesystem
ext2_get_block_wq() is used only by ext2_prepare_write().
All other cases still use the synchronous get block
version.
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
07ext2getblk_wq.patch
.....................
fs/ext2/inode.c | 44 +++++++++++++++++++++++++++++++++++---------
1 files changed, 35 insertions(+), 9 deletions(-)
diff -ur -X /home/kiran/dontdiff linux-2.5.68/fs/ext2/inode.c linux-aio-2568/fs/ext2/inode.c
--- linux-2.5.68/fs/ext2/inode.c Fri Apr 11 21:10:49 2003
+++ linux-aio-2568/fs/ext2/inode.c Wed Apr 16 22:48:49 2003
@@ -257,11 +258,12 @@
* or when it reads all @depth-1 indirect blocks successfully and finds
* the whole chain, all way to the data (returns %NULL, *err == 0).
*/
-static Indirect *ext2_get_branch(struct inode *inode,
+static Indirect *ext2_get_branch_wq(struct inode *inode,
int depth,
int *offsets,
Indirect chain[4],
- int *err)
+ int *err,
+ wait_queue_t *wait)
{
struct super_block *sb = inode->i_sb;
Indirect *p = chain;
@@ -273,8 +275,8 @@
if (!p->key)
goto no_block;
while (--depth) {
- bh = sb_bread(sb, le32_to_cpu(p->key));
- if (!bh)
+ bh = sb_bread_wq(sb, le32_to_cpu(p->key), wait);
+ if (!bh || IS_ERR(bh))
goto failure;
read_lock(&EXT2_I(inode)->i_meta_lock);
if (!verify_chain(chain, p))
@@ -292,11 +294,21 @@
*err = -EAGAIN;
goto no_block;
failure:
- *err = -EIO;
+ *err = IS_ERR(bh) ? PTR_ERR(bh) : -EIO;
no_block:
return p;
}
+static Indirect *ext2_get_branch(struct inode *inode,
+ int depth,
+ int *offsets,
+ Indirect chain[4],
+ int *err)
+{
+ return ext2_get_branch_wq(inode, depth, offsets, chain,
+ err, NULL);
+}
+
/**
* ext2_find_near - find a place for allocation with sufficient locality
* @inode: owner
@@ -536,7 +548,8 @@
* reachable from inode.
*/
-static int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create)
+static int ext2_get_block_wq(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create, wait_queue_t *wait)
{
int err = -EIO;
int offsets[4];
@@ -551,7 +564,8 @@
goto out;
reread:
- partial = ext2_get_branch(inode, depth, offsets, chain, &err);
+ partial = ext2_get_branch_wq(inode, depth, offsets, chain, &err,
+ wait);
/* Simplest case - block found, no allocation needed */
if (!partial) {
@@ -565,7 +579,7 @@
}
/* Next simple case - plain lookup or failed read of indirect block */
- if (!create || err == -EIO) {
+ if (!create || err == -EIO || err == -EIOCBRETRY) {
cleanup:
while (partial > chain) {
brelse(partial->bh);
@@ -606,6 +620,19 @@
goto reread;
}
+static int ext2_get_block_async(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ return ext2_get_block_wq(inode, iblock, bh_result, create,
+ current->io_wait);
+}
+
+static int ext2_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ return ext2_get_block_wq(inode, iblock, bh_result, create, NULL);
+}
+
static int ext2_writepage(struct page *page, struct writeback_control *wbc)
{
return block_write_full_page(page, ext2_get_block, wbc);
@@ -627,7 +654,7 @@
ext2_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- return block_prepare_write(page,from,to,ext2_get_block);
+ return block_prepare_write(page,from,to,ext2_get_block_async);
}
static int
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Filesystem AIO read-write patches
2003-04-24 5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
@ 2003-04-24 12:48 ` Matthew Wilcox
0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2003-04-24 12:48 UTC (permalink / raw)
To: Suparna Bhattacharya; +Cc: bcrl, akpm, linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:41:03AM +0530, Suparna Bhattacharya wrote:
> -void blk_congestion_wait(int rw, long timeout)
> +int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait)
> {
> - DEFINE_WAIT(wait);
> wait_queue_head_t *wqh = &congestion_wqh[rw];
> + DEFINE_WAIT(local_wait);
> +
> + if (!wait)
> + wait = &local_wait;
>
> blk_run_queues();
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
> + if (!is_sync_wait(wait))
> + return -EIOCBRETRY;
> +
> io_schedule_timeout(timeout);
> - finish_wait(wqh, &wait);
> + finish_wait(wqh, wait);
> + return 0;
> +}
> +
> +void blk_congestion_wait(int rw, long timeout)
> +{
> + blk_congestion_wait_wq(rw, timeout, NULL);
> }
I don't like this one. DEFINE_WAIT allocates stuff on the stack.
And aio seems to allocate a lot more on the stack than normal io does.
How about the following instead:
int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait)
{
wait_queue_head_t *wqh = &congestion_wqh[rw];
blk_run_queues();
prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
if (!is_sync_wait(wait))
return -EIOCBRETRY;
io_schedule_timeout(timeout);
finish_wait(wqh, wait);
return 0;
}
void blk_congestion_wait(int rw, long timeout)
{
DEFINE_WAIT(wait);
blk_congestion_wait_wq(rw, timeout, &wait);
}
--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Filesystem AIO read-write patches
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
` (6 preceding siblings ...)
2003-04-24 5:28 ` [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2 Suparna Bhattacharya
@ 2003-04-24 13:13 ` Benjamin LaHaise
2003-04-24 14:27 ` Suparna Bhattacharya
7 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2003-04-24 13:13 UTC (permalink / raw)
To: Suparna Bhattacharya; +Cc: akpm, linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> The task->io_wait field reflects the wait context in which
> a task is executing its i/o operations. For synchronous i/o
> task->io_wait is NULL, and the wait context is local on
> stack; for threads doing io submits or retries on behalf
> of async i/o callers, tsk->io_wait is the wait queue
> function entry to be notified on completion of a condition
> required for i/o to progress.
This is seriously wrong. Changing the behaviour of functions to depend on
a hidden parameter is bound to lead to coding errors. It is far better to
make the parameter explicite rather than implicite.
-ben
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Filesystem AIO read-write patches
2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
@ 2003-04-24 14:27 ` Suparna Bhattacharya
0 siblings, 0 replies; 11+ messages in thread
From: Suparna Bhattacharya @ 2003-04-24 14:27 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: akpm, linux-kernel, linux-aio, linux-fsdevel
On Thu, Apr 24, 2003 at 09:13:12AM -0400, Benjamin LaHaise wrote:
> On Thu, Apr 24, 2003 at 10:22:22AM +0530, Suparna Bhattacharya wrote:
> > The task->io_wait field reflects the wait context in which
> > a task is executing its i/o operations. For synchronous i/o
> > task->io_wait is NULL, and the wait context is local on
> > stack; for threads doing io submits or retries on behalf
> > of async i/o callers, tsk->io_wait is the wait queue
> > function entry to be notified on completion of a condition
> > required for i/o to progress.
>
> This is seriously wrong. Changing the behaviour of functions to depend on
> a hidden parameter is bound to lead to coding errors. It is far better to
> make the parameter explicite rather than implicite.
I really did think this over for several days before
putting out the new iteration. And I'm still in two minds
about this.
To me it seems like part of the context of execution rather
than a hidden parameter as you see it, much like an address
space (the way you do a use_mm() for retries) or credentials
for that matter (other operating systems for example pass a
cred structure to vfs routines, but Linux doesn't). And
because wait is typically associated with a task context
anyway (in the synchronous world).
What differentiates if something should be picked up as
part of a context or a parameter ?
It should be a parameter if different values may be passed
in within the same context (e.g routines which can modify
a different address space would take an mm parameter).
That's why all the _wq routines take a wait queue parameter,
and only routines which really are meant to be context
sensitive reference tsk->io_wait. That's why I got rid of
do_sync_op(), but still kept tsk->io_wait ...
The only thing that nags me is that the context of
an io operation is not a very long-lived one.
That said, the main problem in implementing what you suggest
is that we'll need to change at least the following APIs
rightaway for all filesystems:
- aops->prepare_write()
- fs specific get block routines
to take a wait queue parameter (I don't say iocb
here, because these are per page or per block routines,
which don't use the iocb information)
If there is a general agreement that we should do this for
2.5, then sure, we can go ahead. But at least I need to
hear a "yes its worth the pain, and yes its the right thing
to do for 2.5" from affected people.
The more asynchrony we decide to add, the more such api
changes would be needed.
So if the decision is to go ahead and break all filesystems
anyway, can we decide how far we should go with these and
have the API changes made sooner rather than later ?
Should we be adding a wait queue or parameter to more
address space operations ?
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2003-04-24 14:10 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-24 4:52 Filesystem AIO read-write patches Suparna Bhattacharya
2003-04-24 5:00 ` [PATCH 1/7] Filesystem AIO rdwr - aio retry core changes Suparna Bhattacharya
2003-04-24 5:04 ` [PATCH 2/7] Filesystem AIO rdwr - aio read Suparna Bhattacharya
2003-04-24 5:11 ` Filesystem AIO read-write patches Suparna Bhattacharya
2003-04-24 12:48 ` Matthew Wilcox
2003-04-24 5:16 ` [PATCH 4/7] Filesystem AIO rdwr - async down (x86) Suparna Bhattacharya
2003-04-24 5:19 ` [5/7] Filesystem AIO rdwr - use down_wq for aio write Suparna Bhattacharya
2003-04-24 5:21 ` [PATCH 6/7] Filesystem AIO rdwr - async bread Suparna Bhattacharya
2003-04-24 5:28 ` [PATCH 7/7] Filesystem AIO rdwr: async get block for ext2 Suparna Bhattacharya
2003-04-24 13:13 ` Filesystem AIO read-write patches Benjamin LaHaise
2003-04-24 14:27 ` Suparna Bhattacharya
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).