* [RFC][Patch] Retry based aio read for filesystems
@ 2003-03-05 9:17 Suparna Bhattacharya
2003-03-05 9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05 9:17 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-aio, linux-kernel
For the last few days I've been playing with prototyping
a particular flavour of a retry based implementation for
filesystem aio read.
It is rather experimental at this point, with only
limited bits of testing so far. [There are some potential
races which seem to show up in the case of large reads
requiring faulting in of the user space buffer. I've
chosen to leave some printks on now ]
Am posting these initial patches as responses to this
mail for early comments/feedback on the approach and
to improve chances of unearthing any potential
gotchas sooner rather than later :)
A few alternative variations are possible around the
basic retry-driven theme that Ben LaHaise has designed -
each with their pros and cons, so I am hoping that sharing
experiences on what we're trying out may be a step
towards some insights in determining what works best.
Its in two parts:
aioretry.patch : Core aio infrastructure modifications
aioread.patch : Modifications for aio read in
particular (generic_file_aio_read)
I was working with 2.5.62, and haven't yet tried
upgrading to 2.5.64.
The idea was to :
- Keep things as simple as possible with minimal changes
to the current filesystem read/write paths (which I
guess we don't want to disturb too much at this stage).
The way to do this is to make the base aio infrastructure
handle most of the async (retry) complexities.
- Take incremental steps towards making an operation async
- start with blocking->async conversions for the major
blocking points.
- Keep in mind Dave Miller's concern about effect of aio
additions (extra parameters) on sync i/o behaviour. So
retaining sync i/o performance as far as possible gets
precedence over aio performance and latency right now.
- Take incremental steps towards tuning aio performance,
and optimizing specific paths for aio.
A retry-based model implies that each time we make as
much progress as is possible in a non-blocking manner,
and then defer a restart of the operation from where we
left off, at the next opportunity ... and so on, until
we finish. To make sure that a next opportunity does
indeed arise, each time we do _more_ than what we'd do
for simple non-blocking i/o -- we initiate some steps
towards progress without actually waiting for it. Then
we try to mark ourselves to be notified when enough is
ready to try the next restart-from-where-we-left-off
attempt.
3 decisions characterise this particular variation:
1. At what level to retry ?
This patch does it at the highest level (fops level),
i.e. high level aio code retries the generic read
operation passing in the remaining parts of the buffer
to be read.
No retries happen in the sync case at the moment,
though the option exists.
2. What kind of state needs to be saved across retries
and who maintains that ?
The high level aio code keeps adjusting the parameters
to the read as retries progress (this is done by the
routine calling the fops)
There is, however, room for optimizations when
low-level paths can be modified to reuse state
across aio-retries.
3. When to trigger a retry ?
This is the tricky part. This variation uses async
wait queue functions instead of the blocking wait, to
trigger a retry (kick_iocb) when data becomes available.
So the synchronous lock_page, wait_on_page_bit have
their async variations which do an async wait and
return -EIOCBQUEUED (which is propogated up).
(BTW, the races that I'm running into are mostly
related to this area - avoiding simoultaneous retries,
dealing with completion while retry is going on
etc. We need to audit the code and think about this
more closely. I wanted to keep the async wakeup
as simple as possible ... and deal with the races
in some other way)
Ben,
Is this anywhere close to what you had in mind, or
have played with, or were you aiming for retries
at a different (lower) level ? How's your experience
been with other variations that you've tried ?
Would be interesting to take a look and compare notes,
if possible.
Any chance you'll be putting something out ?
I guess the retry state you maintain may be a little more
specific to the target/type of aio, or is implemented in
a way lets state already updated by lower level functions
to be reused/carried over directly, rather than recomputed
by the aio code ... - does that work out to be more
optimal ? Or were you working on a different technique
altogether ?
There are a couple of fixes that I made in the aio code
as part of the patch.
- The kiocbClearXXX were doing a set_bit instead of
clear_bit
- Sync iocbs were not woken up when iocb->ki_users = 1
(dio takes a different path for sync and async iocbs,
so maybe that's why we weren't seeing the problem yet)
Hope that's okay.
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 1/2] Retry based aio read - core aio changes
2003-03-05 9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
@ 2003-03-05 9:26 ` Suparna Bhattacharya
2003-03-14 13:23 ` Suparna Bhattacharya
2003-03-05 9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
2003-03-05 23:00 ` [RFC][Patch] Retry based aio read for filesystems Janet Morgan
2 siblings, 1 reply; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05 9:26 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-aio, linux-kernel
On Wed, Mar 05, 2003 at 02:47:54PM +0530, Suparna Bhattacharya wrote:
> For the last few days I've been playing with prototyping
> a particular flavour of a retry based implementation for
> filesystem aio read.
# aioretry.patch : Core aio infrastructure modifications
for high-level retry based aio
Includes couple of fixes in the aio code
-The kiocbClearXXX were doing a set_bit instead of
clear_bit
-Sync iocbs were not woken up when iocb->ki_users = 1
(dio takes a different path for sync and async iocbs,
so maybe that's why we weren't seeing the problem yet)
Regards
Suparna
----------------------------------------------------
diff -ur linux-2.5.62/fs/aio.c linux-2.5.62-aio/fs/aio.c
--- linux-2.5.62/fs/aio.c Tue Feb 18 04:26:14 2003
+++ linux-2.5.62-aio/fs/aio.c Tue Mar 4 19:54:24 2003
@@ -314,14 +314,15 @@
*/
ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
{
- while (iocb->ki_users) {
+ printk("wait_on_sync_iocb\n");
+ while ((iocb->ki_users) && !kiocbIsKicked(iocb)) {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (!iocb->ki_users)
+ if (!iocb->ki_users || kiocbIsKicked(iocb))
break;
schedule();
}
__set_current_state(TASK_RUNNING);
- return iocb->ki_user_data;
+ return iocb->ki_users ? -EIOCBQUEUED : iocb->ki_user_data;
}
/* exit_aio: called when the last user of mm goes away. At this point,
@@ -395,6 +396,7 @@
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_user_obj = NULL;
+ INIT_LIST_HEAD(&req->ki_run_list);
/* Check if the completion queue has enough free space to
* accept an event from this io.
@@ -558,15 +560,20 @@
enter_lazy_tlb(mm, current, smp_processor_id());
}
-/* Run on kevent's context. FIXME: needs to be per-cpu and warn if an
- * operation blocks.
- */
-static void aio_kick_handler(void *data)
+static inline int __queue_kicked_iocb(struct kiocb *iocb)
{
- struct kioctx *ctx = data;
+ struct kioctx *ctx = iocb->ki_ctx;
- use_mm(ctx->mm);
+ if (list_empty(&iocb->ki_run_list)) {
+ list_add_tail(&iocb->ki_run_list,
+ &ctx->run_list);
+ return 1;
+ }
+ return 0;
+}
+static void aio_run_iocbs(struct kioctx *ctx)
+{
spin_lock_irq(&ctx->ctx_lock);
while (!list_empty(&ctx->run_list)) {
struct kiocb *iocb;
@@ -574,30 +581,67 @@
iocb = list_entry(ctx->run_list.next, struct kiocb,
ki_run_list);
- list_del(&iocb->ki_run_list);
+ list_del_init(&iocb->ki_run_list);
iocb->ki_users ++;
- spin_unlock_irq(&ctx->ctx_lock);
- kiocbClearKicked(iocb);
- ret = iocb->ki_retry(iocb);
- if (-EIOCBQUEUED != ret) {
- aio_complete(iocb, ret, 0);
- iocb = NULL;
+ if (!kiocbTryStart(iocb)) {
+ kiocbClearKicked(iocb);
+ spin_unlock_irq(&ctx->ctx_lock);
+ ret = iocb->ki_retry(iocb);
+ if (-EIOCBQUEUED != ret) {
+ if (list_empty(&iocb->ki_wait.task_list))
+ aio_complete(iocb, ret, 0);
+ else
+ printk("can't delete iocb in use\n");
+ }
+ spin_lock_irq(&ctx->ctx_lock);
+ kiocbClearStarted(iocb);
+ if (kiocbIsKicked(iocb))
+ __queue_kicked_iocb(iocb);
+ } else {
+ printk("iocb already started\n");
}
-
- spin_lock_irq(&ctx->ctx_lock);
if (NULL != iocb)
__aio_put_req(ctx, iocb);
}
spin_unlock_irq(&ctx->ctx_lock);
+}
+
+/* Run on aiod/kevent's context. FIXME: needs to be per-cpu and warn if an
+ * operation blocks.
+ */
+static void aio_kick_handler(void *data)
+{
+ struct kioctx *ctx = data;
+
+ use_mm(ctx->mm);
+ aio_run_iocbs(ctx);
unuse_mm(ctx->mm);
}
-void kick_iocb(struct kiocb *iocb)
+
+void queue_kicked_iocb(struct kiocb *iocb)
{
struct kioctx *ctx = iocb->ki_ctx;
+ unsigned long flags;
+ int run = 0;
+ WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
+
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+ run = __queue_kicked_iocb(iocb);
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ if (run) {
+ if (waitqueue_active(&ctx->wait))
+ wake_up(&ctx->wait);
+ else
+ queue_work(aio_wq, &ctx->wq);
+ }
+}
+
+void kick_iocb(struct kiocb *iocb)
+{
/* sync iocbs are easy: they can only ever be executing from a
* single context. */
if (is_sync_kiocb(iocb)) {
@@ -606,12 +650,11 @@
return;
}
- if (!kiocbTryKick(iocb)) {
- unsigned long flags;
- spin_lock_irqsave(&ctx->ctx_lock, flags);
- list_add_tail(&iocb->ki_run_list, &ctx->run_list);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- schedule_work(&ctx->wq);
+
+ if (!kiocbTryKick(iocb) && !kiocbIsStarted(iocb)) {
+ queue_kicked_iocb(iocb);
+ } else {
+ pr_debug("iocb already kicked or in progress\n");
}
}
@@ -642,13 +685,13 @@
iocb->ki_user_data = res;
if (iocb->ki_users == 1) {
iocb->ki_users = 0;
- return 1;
+ ret = 1;
+ } else {
+ spin_lock_irq(&ctx->ctx_lock);
+ iocb->ki_users--;
+ ret = (0 == iocb->ki_users);
+ spin_unlock_irq(&ctx->ctx_lock);
}
- spin_lock_irq(&ctx->ctx_lock);
- iocb->ki_users--;
- ret = (0 == iocb->ki_users);
- spin_unlock_irq(&ctx->ctx_lock);
-
/* sync iocbs put the task here for us */
wake_up_process(iocb->ki_user_obj);
return ret;
@@ -664,6 +707,9 @@
*/
spin_lock_irqsave(&ctx->ctx_lock, flags);
+ if (!list_empty(&iocb->ki_run_list))
+ list_del_init(&iocb->ki_run_list);
+
ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
tail = info->tail;
@@ -865,6 +911,8 @@
ret = 0;
if (to.timed_out) /* Only check after read evt */
break;
+ /* accelerate kicked iocbs for this ctx */
+ aio_run_iocbs(ctx);
schedule();
if (signal_pending(tsk)) {
ret = -EINTR;
@@ -984,6 +1032,114 @@
return -EINVAL;
}
+/* Called during initial submission and subsequent retry operations */
+long aio_process_iocb(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ if (iocb->ki_retried++ > 1024*1024) {
+ printk("Maximal retry count. Bytes done %d\n",
+ iocb->ki_nbytes - iocb->ki_left);
+ return -EAGAIN;
+ }
+
+ if (!(iocb->ki_retried & 0xff)) {
+ printk("%ld aio retries completed %d bytes of %d\n",
+ iocb->ki_retried,
+ iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
+ }
+
+ BUG_ON(current->iocb != NULL);
+
+ current->iocb = iocb;
+
+ switch (iocb->ki_opcode) {
+ case IOCB_CMD_PREAD:
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_READ)))
+ goto out;
+ ret = -EFAULT;
+ if (unlikely(!access_ok(VERIFY_WRITE, iocb->ki_buf,
+ iocb->ki_left)))
+ goto out;
+ ret = -EINVAL;
+ if (file->f_op->aio_read)
+ ret = file->f_op->aio_read(iocb, iocb->ki_buf,
+ iocb->ki_left, iocb->ki_pos);
+ break;
+ case IOCB_CMD_PWRITE:
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
+ goto out;
+ ret = -EFAULT;
+ if (unlikely(!access_ok(VERIFY_READ, iocb->ki_buf,
+ iocb->ki_left)))
+ goto out;
+ ret = -EINVAL;
+ if (file->f_op->aio_write)
+ ret = file->f_op->aio_write(iocb, iocb->ki_buf,
+ iocb->ki_left, iocb->ki_pos);
+ break;
+ case IOCB_CMD_FDSYNC:
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 1);
+ break;
+ case IOCB_CMD_FSYNC:
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 0);
+ break;
+ default:
+ dprintk("EINVAL: io_submit: no operation provided\n");
+ ret = -EINVAL;
+ }
+
+ pr_debug("aio_process_iocb: fop ret %d\n", ret);
+ if (likely(-EIOCBQUEUED == ret)) {
+ blk_run_queues();
+ goto out;
+ }
+ if (ret > 0) {
+ iocb->ki_buf += ret;
+ iocb->ki_left -= ret;
+
+ /* Not done yet or a short read, or.. */
+ if (iocb->ki_left
+ /* may have copied out data but not completed writing */
+ || ((iocb->ki_left == 0) &&
+ (iocb->ki_opcode = IOCB_CMD_PWRITE)) ){
+ /* FIXME:Can we find a better way to handle this ? */
+ /* Force an extra retry to determine if we're done */
+ ret = -EIOCBQUEUED;
+ goto out;
+ }
+
+ }
+
+ if (ret >= 0)
+ ret = iocb->ki_nbytes - iocb->ki_left;
+
+out:
+ pr_debug("ki_pos = %llu\n", iocb->ki_pos);
+ current->iocb = NULL;
+ if ((-EIOCBQUEUED == ret) && list_empty(&iocb->ki_wait.task_list)) {
+ kiocbSetKicked(iocb);
+ }
+
+ return ret;
+}
+
+int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+{
+ struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+
+ list_del_init(&wait->task_list);
+ kick_iocb(iocb);
+ return 1;
+}
+
static int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
struct iocb *iocb));
static int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -991,8 +1147,7 @@
{
struct kiocb *req;
struct file *file;
- ssize_t ret;
- char *buf;
+ long ret;
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1033,50 +1188,34 @@
req->ki_user_data = iocb->aio_data;
req->ki_pos = iocb->aio_offset;
- buf = (char *)(unsigned long)iocb->aio_buf;
+ req->ki_buf = (char *)(unsigned long)iocb->aio_buf;
+ req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+ req->ki_opcode = iocb->aio_lio_opcode;
+ req->ki_retry = aio_process_iocb;
+ init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
+ INIT_LIST_HEAD(&req->ki_wait.task_list);
+ req->ki_retried = 0;
+ kiocbSetStarted(req);
- switch (iocb->aio_lio_opcode) {
- case IOCB_CMD_PREAD:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_READ)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_read)
- ret = file->f_op->aio_read(req, buf,
- iocb->aio_nbytes, req->ki_pos);
- break;
- case IOCB_CMD_PWRITE:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_WRITE)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_write)
- ret = file->f_op->aio_write(req, buf,
- iocb->aio_nbytes, req->ki_pos);
- break;
- case IOCB_CMD_FDSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 1);
- break;
- case IOCB_CMD_FSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 0);
- break;
- default:
- dprintk("EINVAL: io_submit: no operation provided\n");
- ret = -EINVAL;
- }
+ ret = aio_process_iocb(req);
+
+ if (likely(-EIOCBQUEUED == ret)) {
+ int run = 0;
+
+ spin_lock_irq(&ctx->ctx_lock);
+ kiocbClearStarted(req);
+ if (kiocbIsKicked(req))
+ run =__queue_kicked_iocb(req);
+ spin_unlock_irq(&ctx->ctx_lock);
+ if (run)
+ queue_work(aio_wq, &ctx->wq);
- if (likely(-EIOCBQUEUED == ret))
return 0;
+ }
+
+ if ((-EBADF == ret) || (-EFAULT == ret))
+ goto out_put_req;
+
aio_complete(req, ret, 0);
return 0;
diff -ur linux-2.5.62/include/linux/aio.h linux-2.5.62-aio/include/linux/aio.h
--- linux-2.5.62/include/linux/aio.h Tue Feb 18 04:25:50 2003
+++ linux-2.5.62-aio/include/linux/aio.h Mon Mar 3 12:17:12 2003
@@ -29,21 +29,26 @@
#define KIF_LOCKED 0
#define KIF_KICKED 1
#define KIF_CANCELLED 2
+#define KIF_STARTED 3
#define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbTryStart(iocb) test_and_set_bit(KIF_STARTED, &(iocb)->ki_flags)
#define kiocbSetLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbSetStarted(iocb) set_bit(KIF_STARTED, &(iocb)->ki_flags)
-#define kiocbClearLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags)
-#define kiocbClearKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags)
-#define kiocbClearCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
+#define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbClearStarted(iocb) clear_bit(KIF_STARTED, &(iocb)->ki_flags)
#define kiocbIsLocked(iocb) test_bit(0, &(iocb)->ki_flags)
#define kiocbIsKicked(iocb) test_bit(1, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(2, &(iocb)->ki_flags)
+#define kiocbIsStarted(iocb) test_bit(3, &(iocb)->ki_flags)
struct kiocb {
struct list_head ki_run_list;
@@ -62,6 +67,14 @@
void *ki_user_obj; /* pointer to userland's iocb */
__u64 ki_user_data; /* user's data for completion */
loff_t ki_pos;
+
+ /* State that we remember to be able to restart/retry */
+ unsigned short ki_opcode;
+ size_t ki_nbytes; /* copy of iocb->aio_nbytes */
+ char *ki_buf; /* remaining iocb->aio_buf */
+ size_t ki_left; /* remaining bytes */
+ wait_queue_t ki_wait;
+ long ki_retried; /* just for testing */
char private[KIOCB_PRIVATE_SIZE];
};
@@ -77,6 +90,8 @@
(x)->ki_ctx = &tsk->active_mm->default_kioctx; \
(x)->ki_cancel = NULL; \
(x)->ki_user_obj = tsk; \
+ (x)->ki_user_data = 0; \
+ init_wait((&(x)->ki_wait)); \
} while (0)
#define AIO_RING_MAGIC 0xa10a10a1
diff -ur linux-2.5.62/include/linux/init_task.h linux-2.5.62-aio/include/linux/init_task.h
--- linux-2.5.62/include/linux/init_task.h Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/init_task.h Thu Feb 27 19:01:39 2003
@@ -101,6 +101,7 @@
.alloc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .iocb = NULL, \
}
diff -ur linux-2.5.62/include/linux/sched.h linux-2.5.62-aio/include/linux/sched.h
--- linux-2.5.62/include/linux/sched.h Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/sched.h Thu Feb 27 19:01:39 2003
@@ -418,6 +418,8 @@
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+/* current aio handle */
+ struct kiocb *iocb;
};
extern void __put_task_struct(struct task_struct *tsk);
diff -ur linux-2.5.62/kernel/fork.c linux-2.5.62-aio/kernel/fork.c
--- linux-2.5.62/kernel/fork.c Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/kernel/fork.c Tue Mar 4 14:58:44 2003
@@ -128,6 +128,10 @@
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
+ else {
+ if (current->iocb && (wait == ¤t->iocb->ki_wait))
+ printk("prepare_to_wait: iocb->ki_wait in use\n");
+ }
spin_unlock_irqrestore(&q->lock, flags);
}
@@ -834,6 +838,7 @@
p->lock_depth = -1; /* -1 = no lock */
p->start_time = get_jiffies_64();
p->security = NULL;
+ p->iocb = NULL;
retval = -ENOMEM;
if (security_task_alloc(p))
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-05 9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
2003-03-05 9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
@ 2003-03-05 9:30 ` Suparna Bhattacharya
2003-03-05 10:42 ` Andrew Morton
2003-03-05 23:00 ` [RFC][Patch] Retry based aio read for filesystems Janet Morgan
2 siblings, 1 reply; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05 9:30 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-aio, linux-kernel
On Wed, Mar 05, 2003 at 02:47:54PM +0530, Suparna Bhattacharya wrote:
> For the last few days I've been playing with prototyping
> a particular flavour of a retry based implementation for
> filesystem aio read.
# aioread.patch : Modifications for aio read in
# particular (generic_file_aio_read)
diff -ur linux-2.5.62/include/linux/pagemap.h linux-2.5.62-aio/include/linux/pagemap.h
--- linux-2.5.62/include/linux/pagemap.h Tue Feb 18 04:25:49 2003
+++ linux-2.5.62-aio/include/linux/pagemap.h Thu Feb 27 19:01:39 2003
@@ -93,6 +93,16 @@
if (TestSetPageLocked(page))
__lock_page(page);
}
+
+extern int FASTCALL(__aio_lock_page(struct page *page));
+static inline int aio_lock_page(struct page *page)
+{
+ if (TestSetPageLocked(page))
+ return __aio_lock_page(page);
+ else
+ return 0;
+}
+
/*
* This is exported only for wait_on_page_locked/wait_on_page_writeback.
@@ -113,6 +123,15 @@
wait_on_page_bit(page, PG_locked);
}
+extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
+static inline int aio_wait_on_page_locked(struct page *page)
+{
+ if (PageLocked(page))
+ return aio_wait_on_page_bit(page, PG_locked);
+ else
+ return 0;
+}
+
/*
* Wait for a page to complete writeback
*/
diff -ur linux-2.5.62/mm/filemap.c linux-2.5.62-aio/mm/filemap.c
--- linux-2.5.62/mm/filemap.c Tue Feb 18 04:26:11 2003
+++ linux-2.5.62-aio/mm/filemap.c Mon Mar 3 19:32:40 2003
@@ -268,6 +268,43 @@
}
EXPORT_SYMBOL(wait_on_page_bit);
+int aio_schedule(void)
+{
+ if (!current->iocb) {
+ io_schedule();
+ return 0;
+ } else {
+ pr_debug("aio schedule");
+ return -EIOCBQUEUED;
+ }
+}
+
+int aio_wait_on_page_bit(struct page *page, int bit_nr)
+{
+ wait_queue_head_t *waitqueue = page_waitqueue(page);
+ DEFINE_WAIT(sync_wait);
+ wait_queue_t *wait = &sync_wait;
+ int state = TASK_UNINTERRUPTIBLE;
+
+ if (current->iocb) {
+ wait = ¤t->iocb->ki_wait;
+ state = TASK_RUNNING;
+ }
+
+ do {
+ prepare_to_wait(waitqueue, wait, state);
+ if (test_bit(bit_nr, &page->flags)) {
+ sync_page(page);
+ if (-EIOCBQUEUED == aio_schedule())
+ return -EIOCBQUEUED;
+ }
+ } while (test_bit(bit_nr, &page->flags));
+ finish_wait(waitqueue, wait);
+
+ return 0;
+}
+EXPORT_SYMBOL(aio_wait_on_page_bit);
+
/**
* unlock_page() - unlock a locked page
*
@@ -336,6 +373,31 @@
}
EXPORT_SYMBOL(__lock_page);
+int __aio_lock_page(struct page *page)
+{
+ wait_queue_head_t *wqh = page_waitqueue(page);
+ DEFINE_WAIT(sync_wait);
+ wait_queue_t *wait = &sync_wait;
+ int state = TASK_UNINTERRUPTIBLE;
+
+ if (current->iocb) {
+ wait = ¤t->iocb->ki_wait;
+ state = TASK_RUNNING;
+ }
+
+ while (TestSetPageLocked(page)) {
+ prepare_to_wait(wqh, wait, state);
+ if (PageLocked(page)) {
+ sync_page(page);
+ if (-EIOCBQUEUED == aio_schedule())
+ return -EIOCBQUEUED;
+ }
+ }
+ finish_wait(wqh, wait);
+ return 0;
+}
+EXPORT_SYMBOL(__aio_lock_page);
+
/*
* a rather lightweight function, finding and getting a reference to a
* hashed page atomically.
@@ -614,7 +676,13 @@
goto page_ok;
/* Get exclusive access to the page ... */
- lock_page(page);
+
+ if (aio_lock_page(page)) {
+ pr_debug("queued lock page \n");
+ error = -EIOCBQUEUED;
+ /* TBD: should we hold on to the cached page ? */
+ goto sync_error;
+ }
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
@@ -636,12 +704,18 @@
if (!error) {
if (PageUptodate(page))
goto page_ok;
- wait_on_page_locked(page);
+ if (aio_wait_on_page_locked(page)) {
+ pr_debug("queued wait_on_page \n");
+ error = -EIOCBQUEUED;
+ /*TBD:should we hold on to the cached page ?*/
+ goto sync_error;
+ }
if (PageUptodate(page))
goto page_ok;
error = -EIO;
}
+sync_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
page_cache_release(page);
@@ -813,6 +887,7 @@
ssize_t ret;
init_sync_kiocb(&kiocb, filp);
+ BUG_ON(current->iocb != NULL);
ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
@@ -844,6 +919,7 @@
{
read_descriptor_t desc;
+ BUG_ON(current->iocb != NULL);
if (!count)
return 0;
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-05 9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
@ 2003-03-05 10:42 ` Andrew Morton
2003-03-05 12:14 ` Suparna Bhattacharya
0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2003-03-05 10:42 UTC (permalink / raw)
To: suparna; +Cc: bcrl, linux-aio, linux-kernel
Suparna Bhattacharya <suparna@in.ibm.com> wrote:
>
> +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
> +static inline int aio_wait_on_page_locked(struct page *page)
Oh boy.
There are soooo many more places where we can block:
- write() -> down(&inode->i_sem)
- read()/write() -> read indirect block -> wait_on_buffer()
- read()/write() -> read bitmap block -> wait_on_buffer()
- write() -> allocate block -> mark_buffer_dirty() ->
balance_dirty_pages() -> writer throttling
- write() -> allocate block -> journal_get_write_access()
- read()/write() -> update_a/b/ctime() -> journal_get_write_access()
- ditto for other journalling filesystems
- read()/write() -> anything -> get_request_wait()
(This one can be avoided by polling the backing_dev_info congestion
flags)
- read()/write() -> page allocator -> blk_congestion_wait()
- write() -> balance_dirty_pages() -> writer throttling
- probably others.
Now, I assume that what you're looking for here is an 80% solution, but it
seems that a lot more changes will be needed to get even that far.
And given that a single kernel thread per spindle can easily keep that
spindle saturated all the time, one does wonder "why try to do it this way at
all"?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-05 10:42 ` Andrew Morton
@ 2003-03-05 12:14 ` Suparna Bhattacharya
2003-03-31 18:32 ` Janet Morgan
0 siblings, 1 reply; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-05 12:14 UTC (permalink / raw)
To: Andrew Morton; +Cc: bcrl, linux-aio, linux-kernel
On Wed, Mar 05, 2003 at 02:42:54AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> >
> > +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
> > +static inline int aio_wait_on_page_locked(struct page *page)
>
> Oh boy.
>
> There are soooo many more places where we can block:
Oh yes, there are lots (I'm simply shutting my eyes
to them till we've conquered at least a few :))
What I'm trying to do, is to look at them one at a time,
from the ones that have the greatest potential of benefits/
improvement based on impact and complexity.
Otherwise it'd be just too hard to get anywhere !
Actually that was one reason why I decided to post
this early. Want to catch the most important ones
and make sure we can deal with them at least.
Read is the easier case, which is why I started with
it. Will come back to write case sometime later after
I've played with a bit.
Even with read there is one more case besides your
list. The copy_to_user or fault_in_writable_pages can
block too ;) (that's what I'm running into) ..
However, the general idea is that any point of blocking
where it is possible for things to just work if we return
instead of waiting, and issue a retry that would
continue from the point where it left off, can be
handled within the basic framework. What we need to do
is to look at these on a case by case basis and see
if that is doable. My hunch is that for congestion and
throttling points it should be possible. And we already
have a lot of pipelining and restartability built
into the VFS now.
Mostly we just need to be able to make sure the
-EIOCBQUEUED returns can be propagated all the way up,
without breaking anything.
Its really the meta-data related waits that are a
black box for me, and wasn't planning on tackling yet ...
More so as I guess it could mean getting into very
filesystem specific territory so doing it consistently
may not be that easy.
>
> - write() -> down(&inode->i_sem)
>
> - read()/write() -> read indirect block -> wait_on_buffer()
>
> - read()/write() -> read bitmap block -> wait_on_buffer()
>
> - write() -> allocate block -> mark_buffer_dirty() ->
> balance_dirty_pages() -> writer throttling
>
> - write() -> allocate block -> journal_get_write_access()
>
> - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
>
> - ditto for other journalling filesystems
>
> - read()/write() -> anything -> get_request_wait()
> (This one can be avoided by polling the backing_dev_info congestion
> flags)
I was thinking of a get_request_async_wait() that unplugs
the queue, and returns -EIOCBQUEUED after queueing the
async waiter to just retry the operation.
Of course this would work if the caller is able to push
back -EIOCBQUEUED without breaking anything in a
non-startable way.
>
> - read()/write() -> page allocator -> blk_congestion_wait()
>
> - write() -> balance_dirty_pages() -> writer throttling
>
> - probably others.
>
> Now, I assume that what you're looking for here is an 80% solution, but it
> seems that a lot more changes will be needed to get even that far.
If we don't bother about meta-data and indirect blocks
just yet, wouldn't the gains we get otherwise not be
worth it ?
>
> And given that a single kernel thread per spindle can easily keep that
> spindle saturated all the time, one does wonder "why try to do it this way at
> all"?
Need some more explanation on how/where you really break up
a generic_file_aio_read operation (with the page_cache_read,
readpage, copy_to_user) on a per-spindle basis. Aren't we at
a higher level of abstraction compared to disk at this stage ?
I can visualize delegating all readpage calls to a worker
thread per-backing device or something like that (forgetting
LVM/RAID for a while), but what about the rest of the parts ?
Are you suggesting restructuring the generic_file_aio_read
code to separate out these stages, so it can identify where
it is and handoff itself accordingly to the right worker ?
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][Patch] Retry based aio read for filesystems
2003-03-05 9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
2003-03-05 9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
2003-03-05 9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
@ 2003-03-05 23:00 ` Janet Morgan
2 siblings, 0 replies; 15+ messages in thread
From: Janet Morgan @ 2003-03-05 23:00 UTC (permalink / raw)
To: suparna; +Cc: bcrl, akpm, linux-aio, linux-kernel, lse-tech
Suparna Bhattacharya wrote:
> For the last few days I've been playing with prototyping
> a particular flavour of a retry based implementation for
> filesystem aio read.
Hi Suparna,
I ran an aio-enabled version of fsx on your patches and no errors were
reported. I plan on using gcov to determine the test coverage I
obtained.
I also did a quick performance test using wall time to compare sync and
async read operations with your patches applied. For the sync case I
ran
10 processes in parallel, each reading 1GB in 1MB chunks from a
per-process
dedicated device. I compared that to a single aio process iteratively
calling
io_submit/io_getevents for up to 10 iocbs/events where each iocb
specified a
1MB read to its dedicated device until 1GB was read. Whew!
The result was that wall time for the sync and async testcases were
consistently identical, i.e., 1m30s:
# sync_test
start time: Wed Mar 5 14:02:05 PST 2003
end time: Wed Mar 5 14:03:35 PST 2003
# aio_test
start time: Wed Mar 5 13:52:04 PST 2003
end time: Wed Mar 5 13:53:34 PST 2003
syncio vmstat:
procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
4 6 2 3324 3888 11356 3668848 0 0 151296 0 2216 2395 0
100 0
4 6 1 3324 3888 11368 3668820 0 0 151744 5 2216 2368 0
100 0
1 9 1 3324 3952 11368 3668860 0 0 151209 1 2213 2356 0
100 0
5 5 1 3324 3940 11344 3668864 0 2 148948 3 2215 2387 0
100 0
1 9 1 3348 3936 11264 3668968 0 6 150767 7 2209 2345 0
100 0
6 4 1 3480 3920 11192 3669364 0 33 151456 33 2218 2340 0
100 0
4 6 2 3568 3896 11316 3669352 0 21 151887 21 2218 2385 0
100 0
7 3 1 3704 3820 11364 3669428 31 34 148687 35 2222 2344
1 99 0
aio vmstat:
procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
2 0 1 4016 4040 11192 3669644 0 17 133152 25 2073 502
0 40 60
1 0 1 4016 4104 11196 3669716 0 0 132288 1 2073 537
0 40 60
2 0 2 4016 4104 11196 3669764 0 0 132416 0 2067 511
0 40 60
1 0 1 4016 4104 11200 3669788 0 0 133088 1 2075 523
0 41 59
1 0 1 4016 5576 11200 3668240 0 0 132384 0 2066 526
0 40 60
1 0 1 4036 4092 11200 3669756 0 5 135116 5 2094 492
0 46 54
1 0 1 4180 4016 11192 3669944 0 36 135968 40 2111 499
0 46 54
2 0 2 4180 4060 11176 3669832 0 0 137152 0 2119 463
1 46 53
1 0 1 4180 7060 11180 3666980 0 0 136384 2 2107 498
0 44 56
-Janet
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 1/2] Retry based aio read - core aio changes
2003-03-05 9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
@ 2003-03-14 13:23 ` Suparna Bhattacharya
0 siblings, 0 replies; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-03-14 13:23 UTC (permalink / raw)
To: bcrl, akpm; +Cc: linux-aio, linux-kernel, lse-tech
On Wed, Mar 05, 2003 at 02:56:33PM +0530, Suparna Bhattacharya wrote:
> On Wed, Mar 05, 2003 at 02:47:54PM +0530, Suparna Bhattacharya wrote:
> > For the last few days I've been playing with prototyping
> > a particular flavour of a retry based implementation for
> > filesystem aio read.
>
>
> # aioretry.patch : Core aio infrastructure modifications
> for high-level retry based aio
Ben pointed out that at least, we shouldn't be duplicating
the switch on every retry. So here's another take which
cleans things up a bit as well.
It is still very high-level (would like to experiment with
it a little further to find out how it behaves/performs in
practice).
(I've checked that that patch applies to 2.5.64-bk8)
Earlier today I posted separate patches for the
fixes listed below, so they are no longer part of this
patch.
Comments/feedback welcome.
>
> Includes couple of fixes in the aio code
> -The kiocbClearXXX were doing a set_bit instead of
> clear_bit
> -Sync iocbs were not woken up when iocb->ki_users = 1
> (dio takes a different path for sync and async iocbs,
> so maybe that's why we weren't seeing the problem yet)
>
Regards
Suparna
diff -ur linux-2.5.62/fs/aio.c linux-2.5.62-aio/fs/aio.c
--- linux-2.5.62/fs/aio.c Tue Feb 18 04:26:14 2003
+++ linux-2.5.62-aio/fs/aio.c Tue Mar 11 21:07:35 2003
@@ -395,6 +396,7 @@
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_user_obj = NULL;
+ INIT_LIST_HEAD(&req->ki_run_list);
/* Check if the completion queue has enough free space to
* accept an event from this io.
@@ -558,46 +560,121 @@
enter_lazy_tlb(mm, current, smp_processor_id());
}
-/* Run on kevent's context. FIXME: needs to be per-cpu and warn if an
- * operation blocks.
- */
-static void aio_kick_handler(void *data)
+static inline int __queue_kicked_iocb(struct kiocb *iocb)
{
- struct kioctx *ctx = data;
+ struct kioctx *ctx = iocb->ki_ctx;
- use_mm(ctx->mm);
+ if (list_empty(&iocb->ki_run_list)) {
+ list_add_tail(&iocb->ki_run_list,
+ &ctx->run_list);
+ return 1;
+ }
+ return 0;
+}
- spin_lock_irq(&ctx->ctx_lock);
- while (!list_empty(&ctx->run_list)) {
- struct kiocb *iocb;
- long ret;
+/* Expects to be called with iocb->ki_ctx->lock held */
+static ssize_t aio_run_iocb(struct kiocb *iocb)
+{
+ struct kioctx *ctx = iocb->ki_ctx;
+ ssize_t (*retry)(struct kiocb *);
+ ssize_t ret;
- iocb = list_entry(ctx->run_list.next, struct kiocb,
- ki_run_list);
- list_del(&iocb->ki_run_list);
- iocb->ki_users ++;
- spin_unlock_irq(&ctx->ctx_lock);
+ if (iocb->ki_retried++ > 1024*1024) {
+ printk("Maximal retry count. Bytes done %d\n",
+ iocb->ki_nbytes - iocb->ki_left);
+ return -EAGAIN;
+ }
+
+ if (!(iocb->ki_retried & 0xff)) {
+ printk("%ld aio retries completed %d bytes of %d\n",
+ iocb->ki_retried,
+ iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
+ }
+
+ if (!(retry = iocb->ki_retry))
+ return 0;
+
+ iocb->ki_users ++;
+ kiocbClearKicked(iocb);
+ iocb->ki_retry = NULL;
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ BUG_ON(current->iocb != NULL);
+
+ current->iocb = iocb;
+ ret = retry(iocb);
+ current->iocb = NULL;
- kiocbClearKicked(iocb);
- ret = iocb->ki_retry(iocb);
- if (-EIOCBQUEUED != ret) {
+ if (-EIOCBQUEUED != ret) {
+ if (list_empty(&iocb->ki_wait.task_list))
aio_complete(iocb, ret, 0);
- iocb = NULL;
- }
+ else
+ printk("can't delete iocb in use\n");
+ } else {
+ if (list_empty(&iocb->ki_wait.task_list))
+ kiocbSetKicked(iocb);
+ }
+ spin_lock_irq(&ctx->ctx_lock);
- spin_lock_irq(&ctx->ctx_lock);
- if (NULL != iocb)
- __aio_put_req(ctx, iocb);
+ iocb->ki_retry = retry;
+ INIT_LIST_HEAD(&iocb->ki_run_list);
+ if (kiocbIsKicked(iocb)) {
+ BUG_ON(ret != -EIOCBQUEUED);
+ __queue_kicked_iocb(iocb);
+ }
+ __aio_put_req(ctx, iocb);
+ return ret;
+}
+
+static void aio_run_iocbs(struct kioctx *ctx)
+{
+ struct kiocb *iocb;
+ ssize_t ret;
+
+ spin_lock_irq(&ctx->ctx_lock);
+ while (!list_empty(&ctx->run_list)) {
+ iocb = list_entry(ctx->run_list.next, struct kiocb,
+ ki_run_list);
+ list_del(&iocb->ki_run_list);
+ ret = aio_run_iocb(iocb);
}
spin_unlock_irq(&ctx->ctx_lock);
+}
+
+/* Run on aiod/kevent's context. FIXME: needs to be per-cpu and warn if an
+ * operation blocks.
+ */
+static void aio_kick_handler(void *data)
+{
+ struct kioctx *ctx = data;
+ use_mm(ctx->mm);
+ aio_run_iocbs(ctx);
unuse_mm(ctx->mm);
}
-void kick_iocb(struct kiocb *iocb)
+
+void queue_kicked_iocb(struct kiocb *iocb)
{
struct kioctx *ctx = iocb->ki_ctx;
+ unsigned long flags;
+ int run = 0;
+
+ WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
+
+ spin_lock_irqsave(&ctx->ctx_lock, flags);
+ run = __queue_kicked_iocb(iocb);
+ spin_unlock_irqrestore(&ctx->ctx_lock, flags);
+ if (run) {
+ if (waitqueue_active(&ctx->wait))
+ wake_up(&ctx->wait);
+ else
+ queue_work(aio_wq, &ctx->wq);
+ }
+}
+void kick_iocb(struct kiocb *iocb)
+{
/* sync iocbs are easy: they can only ever be executing from a
* single context. */
if (is_sync_kiocb(iocb)) {
@@ -607,11 +684,9 @@
}
if (!kiocbTryKick(iocb)) {
- unsigned long flags;
- spin_lock_irqsave(&ctx->ctx_lock, flags);
- list_add_tail(&iocb->ki_run_list, &ctx->run_list);
- spin_unlock_irqrestore(&ctx->ctx_lock, flags);
- schedule_work(&ctx->wq);
+ queue_kicked_iocb(iocb);
+ } else {
+ pr_debug("iocb already kicked or in progress\n");
}
}
@@ -664,6 +739,9 @@
*/
spin_lock_irqsave(&ctx->ctx_lock, flags);
+ if (!list_empty(&iocb->ki_run_list))
+ list_del_init(&iocb->ki_run_list);
+
ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
tail = info->tail;
@@ -865,6 +943,8 @@
ret = 0;
if (to.timed_out) /* Only check after read evt */
break;
+ /* accelerate kicked iocbs for this ctx */
+ aio_run_iocbs(ctx);
schedule();
if (signal_pending(tsk)) {
ret = -EINTR;
@@ -984,6 +1064,149 @@
return -EINVAL;
}
+ssize_t aio_pread(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ ret = file->f_op->aio_read(iocb, iocb->ki_buf,
+ iocb->ki_left, iocb->ki_pos);
+
+ pr_debug("aio_pread: fop ret %d\n", ret);
+
+ /*
+ * Can't just depend on iocb->ki_left to determine
+ * whether we are done. This may have been a short read.
+ */
+ if (ret > 0) {
+ iocb->ki_buf += ret;
+ iocb->ki_left -= ret;
+
+ ret = -EIOCBQUEUED;
+ }
+
+ /* This means we must have transferred all that we could */
+ /* No need to retry anymore */
+ if (ret == 0)
+ ret = iocb->ki_nbytes - iocb->ki_left;
+
+ return ret;
+}
+
+ssize_t aio_pwrite(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ ret = file->f_op->aio_write(iocb, iocb->ki_buf,
+ iocb->ki_left, iocb->ki_pos);
+
+ pr_debug("aio_pread: fop ret %d\n", ret);
+
+ /*
+ * TBD: Even if iocb->ki_left = 0, could we need to
+ * wait for data to be sync'd ? Or can we assume
+ * that aio_fdsync/aio_fsync would be called explicitly
+ * as required.
+ */
+ if (ret > 0) {
+ iocb->ki_buf += ret;
+ iocb->ki_left -= ret;
+
+ ret = -EIOCBQUEUED;
+ }
+
+ /* This means we must have transferred all that we could */
+ /* No need to retry anymore */
+ if (ret == 0)
+ ret = iocb->ki_nbytes - iocb->ki_left;
+
+ return ret;
+}
+
+ssize_t aio_fdsync(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 1);
+ return ret;
+}
+
+ssize_t aio_fsync(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 0);
+ return ret;
+}
+
+/* Called during initial submission and subsequent retry operations */
+ssize_t aio_setup_iocb(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = 0;
+
+ switch (iocb->ki_opcode) {
+ case IOCB_CMD_PREAD:
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_READ)))
+ break;
+ ret = -EFAULT;
+ if (unlikely(!access_ok(VERIFY_WRITE, iocb->ki_buf,
+ iocb->ki_left)))
+ break;
+ ret = -EINVAL;
+ if (file->f_op->aio_read)
+ iocb->ki_retry = aio_pread;
+ break;
+ case IOCB_CMD_PWRITE:
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
+ break;
+ ret = -EFAULT;
+ if (unlikely(!access_ok(VERIFY_READ, iocb->ki_buf,
+ iocb->ki_left)))
+ break;
+ ret = -EINVAL;
+ if (file->f_op->aio_write)
+ iocb->ki_retry = aio_pwrite;
+ break;
+ case IOCB_CMD_FDSYNC:
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ iocb->ki_retry = aio_fdsync;
+ break;
+ case IOCB_CMD_FSYNC:
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ iocb->ki_retry = aio_fsync;
+ break;
+ default:
+ dprintk("EINVAL: io_submit: no operation provided\n");
+ ret = -EINVAL;
+ }
+
+ if (!iocb->ki_retry)
+ return ret;
+
+ pr_debug("ki_pos = %llu\n", iocb->ki_pos);
+
+ return 0;
+}
+
+int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync)
+{
+ struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+
+ list_del_init(&wait->task_list);
+ kick_iocb(iocb);
+ return 1;
+}
+
static int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
struct iocb *iocb));
static int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -992,7 +1215,6 @@
struct kiocb *req;
struct file *file;
ssize_t ret;
- char *buf;
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1033,51 +1255,27 @@
req->ki_user_data = iocb->aio_data;
req->ki_pos = iocb->aio_offset;
- buf = (char *)(unsigned long)iocb->aio_buf;
+ req->ki_buf = (char *)(unsigned long)iocb->aio_buf;
+ req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
+ req->ki_opcode = iocb->aio_lio_opcode;
+ init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
+ INIT_LIST_HEAD(&req->ki_wait.task_list);
+ req->ki_run_list.next = req->ki_run_list.prev = NULL;
+ req->ki_retry = NULL;
+ req->ki_retried = 0;
- switch (iocb->aio_lio_opcode) {
- case IOCB_CMD_PREAD:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_READ)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_read)
- ret = file->f_op->aio_read(req, buf,
- iocb->aio_nbytes, req->ki_pos);
- break;
- case IOCB_CMD_PWRITE:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_WRITE)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_write)
- ret = file->f_op->aio_write(req, buf,
- iocb->aio_nbytes, req->ki_pos);
- break;
- case IOCB_CMD_FDSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 1);
- break;
- case IOCB_CMD_FSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 0);
- break;
- default:
- dprintk("EINVAL: io_submit: no operation provided\n");
- ret = -EINVAL;
- }
+ ret = aio_setup_iocb(req);
+
+ if ((-EBADF == ret) || (-EFAULT == ret))
+ goto out_put_req;
+
+ spin_lock_irq(&ctx->ctx_lock);
+ ret = aio_run_iocb(req);
+ spin_unlock_irq(&ctx->ctx_lock);
+
+ if (-EIOCBQUEUED == ret)
+ queue_work(aio_wq, &ctx->wq);
- if (likely(-EIOCBQUEUED == ret))
- return 0;
- aio_complete(req, ret, 0);
return 0;
out_put_req:
diff -ur linux-2.5.62/include/linux/aio.h linux-2.5.62-aio/include/linux/aio.h
--- linux-2.5.62/include/linux/aio.h Tue Feb 18 04:25:50 2003
+++ linux-2.5.62-aio/include/linux/aio.h Tue Mar 11 21:31:22 2003
@@ -54,7 +54,7 @@
struct file *ki_filp;
struct kioctx *ki_ctx; /* may be NULL for sync ops */
int (*ki_cancel)(struct kiocb *, struct io_event *);
- long (*ki_retry)(struct kiocb *);
+ ssize_t (*ki_retry)(struct kiocb *);
struct list_head ki_list; /* the aio core uses this
* for cancellation */
@@ -62,6 +62,14 @@
void *ki_user_obj; /* pointer to userland's iocb */
__u64 ki_user_data; /* user's data for completion */
loff_t ki_pos;
+
+ /* State that we remember to be able to restart/retry */
+ unsigned short ki_opcode;
+ size_t ki_nbytes; /* copy of iocb->aio_nbytes */
+ char *ki_buf; /* remaining iocb->aio_buf */
+ size_t ki_left; /* remaining bytes */
+ wait_queue_t ki_wait;
+ long ki_retried; /* just for testing */
char private[KIOCB_PRIVATE_SIZE];
};
@@ -77,6 +85,8 @@
(x)->ki_ctx = &tsk->active_mm->default_kioctx; \
(x)->ki_cancel = NULL; \
(x)->ki_user_obj = tsk; \
+ (x)->ki_user_data = 0; \
+ init_wait((&(x)->ki_wait)); \
} while (0)
#define AIO_RING_MAGIC 0xa10a10a1
diff -ur linux-2.5.62/include/linux/init_task.h linux-2.5.62-aio/include/linux/init_task.h
--- linux-2.5.62/include/linux/init_task.h Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/init_task.h Thu Feb 27 19:01:39 2003
@@ -101,6 +101,7 @@
.alloc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .iocb = NULL, \
}
diff -ur linux-2.5.62/include/linux/sched.h linux-2.5.62-aio/include/linux/sched.h
--- linux-2.5.62/include/linux/sched.h Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/include/linux/sched.h Thu Feb 27 19:01:39 2003
@@ -418,6 +418,8 @@
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+/* current aio handle */
+ struct kiocb *iocb;
};
extern void __put_task_struct(struct task_struct *tsk);
diff -ur linux-2.5.62/kernel/fork.c linux-2.5.62-aio/kernel/fork.c
--- linux-2.5.62/kernel/fork.c Tue Feb 18 04:25:53 2003
+++ linux-2.5.62-aio/kernel/fork.c Tue Mar 4 14:58:44 2003
@@ -834,6 +838,7 @@
p->lock_depth = -1; /* -1 = no lock */
p->start_time = get_jiffies_64();
p->security = NULL;
+ p->iocb = NULL;
retval = -ENOMEM;
if (security_task_alloc(p))
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-05 12:14 ` Suparna Bhattacharya
@ 2003-03-31 18:32 ` Janet Morgan
2003-03-31 19:11 ` William Lee Irwin III
0 siblings, 1 reply; 15+ messages in thread
From: Janet Morgan @ 2003-03-31 18:32 UTC (permalink / raw)
To: akpm; +Cc: suparna, bcrl, linux-aio, linux-kernel
> On Wed, Mar 05, 2003 at 02:42:54AM -0800, Andrew Morton wrote:
> > Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> >
> > +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
> > +static inline int aio_wait_on_page_locked(struct page *page)
>
> Oh boy.
>
> There are soooo many more places where we can block:
>
> - write() -> down(&inode->i_sem)
>
> - read()/write() -> read indirect block -> wait_on_buffer()
>
> - read()/write() -> read bitmap block -> wait_on_buffer()
>
> - write() -> allocate block -> mark_buffer_dirty() ->
> balance_dirty_pages() -> writer throttling
>
> - write() -> allocate block -> journal_get_write_access()
>
> - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
>
> - ditto for other journalling filesystems
>
> - read()/write() -> anything -> get_request_wait()
> (This one can be avoided by polling the backing_dev_info congestion
> flags)
>
> - read()/write() -> page allocator -> blk_congestion_wait()
>
> - write() -> balance_dirty_pages() -> writer throttling
>
> - probably others.
>
> Now, I assume that what you're looking for here is an 80% solution, but it
> seems that a lot more changes will be needed to get even that far.
I'm trying to identify significant blocking points in the filesystem read
path. I'm not sure what the best approach is, so figured I'd just track
callers of schedule() under a heavy dd workload.
I collected data while running 1000 processes, where each process was
using dd to sequentially read from a 1GB file. I used 10 target files
in all, so 100 processes read from file1, 100 processes read from file2,
etc. All these numbers were pretty much arbitrary. I used stock 2.5.62
to test.
The top 3 callers of schedule based on my dd read workload are listed
below. Together they accounted for 92% of all calls to schedule.
Suparna's filesystem aio patch already modifies 2 of these 3 callpoints
to be "retryable". So the remaining candidate is the call to cond_resched
from do_generic_mapping_read. Question is whether this qualifies as the
sort of thing that should cause a retry, i.e., is cond_resched a kind of
voluntary yield/preemption point which may not even result in a context
switch if there is nothing more preferable to run? And even if the call to
cond_resched is not cause for retry, the profile data seems to indicate that
Suparna's patch is roughly an 80% solution (unless I'm missing something here,
which is entirely possible ;-).
So here are the call statistics:
70% of all calls to schedule were from __lock_page:
Based on the profile data, almost all calls to _lock_page were
from do_generic_mapping_read (see filemap.c/Line 101 below).
Suparna's patch already retries here.
15% of all calls to schedule were from __cond_resched:
do_generic_mapping_read -> cond_resched -> __cond_resched
(see filemap.c/Line 41 below).
Should this be made retryable ???
7% of all calls to schedule were from wait_on_page_bit:
Based on the profile data, almost all calls to wait_on_page_bit were
from do_generic_mapping_read->wait_on_page_locked -> wait_on_page_bit
(see filemap.c/Line 123 below). Suparna's patch covers this
callpoint, too.
Here's some detail....
Callers of schedule()
7768210 Total calls to schedule
------------
5440195 __lock_page+176
1143862 do_generic_mapping_read+d6
569139 wait_on_page_bit+17e
488273 work_resched+5
43414 __wait_on_buffer+14a
23594 blk_congestion_wait+99
21171 worker_thread+14f
9838 cpu_idle+6d
6308 do_select+29b
5178 schedule_timeout+b8
4136 sys_sched_yield+d9
2096 kswapd+122
1796 sleep_on+55
1767 sys_nanosleep+f8
1655 do_exit+4eb
1643 wait_for_completion+d2
1365 interruptible_sleep_on+55
1034 pipe_wait+98
1004 balanced_irq+6d
975 ksoftirqd+108
971 journal_stop+107
884 sys_wait4+2ff
780
658 unix_wait_for_peer+d6
576 read_chan+52c
580 __pdflush+d2
105 do_poll+ec
102 __down+9e
71 tty_wait_until_sent+f2
59 ksoftirqd+b5
55 __down_interruptible+d8
41
16 migration_thread+cc
10 uart_wait_until_sent+d6
9 unix_stream_data_wait+109
8 tcp_data_wait+16c
4 wait_for_packet+140
3 wait_til_done+110
2 write_chan+280
2 generic_file_aio_write_nolock+bf4
2 journal_commit_transaction+97a
1 usb_hub_thread+c0
1 jfs_lazycommit+1c6
1 serio_thread+e0
1 jfs_sync+259
1 jfsIOWait+15d
1 _lock_fdc+12f
1 acpi_ex_system_do_suspend+3d
1 init_pcmcia_ds+25e
readprofile:
10058989 total 2.9520
5982060 __copy_to_user_ll 41542.0833
928000 do_generic_mapping_read 698.7952
447968 poll_idle 4666.3333
301263 file_read_actor 1176.8086
265660 radix_tree_lookup 1844.8611
209991 page_cache_readahead 410.1387
129655 vfs_read 311.6707
117666 kmap_atomic 919.2656
111874 fget 1165.3542
110996 __generic_file_aio_read 182.5592
109402 kunmap_atomic 3418.8125
95092 vfs_write 228.5865
76119 mark_page_accessed 679.6339
69564 current_kernel_time 724.6250
52569 fput 1095.1875
51824 update_atime 231.3571
51375 activate_page 267.5781
46577 scsi_request_fn 61.9375
41731 generic_file_read 237.1080
39300 shrink_cache 38.9881
37658 schedule 26.1514
37501 refill_inactive_zone 22.3220
37325 scsi_dispatch_cmd 70.6913
32762 unlock_page 292.5179
31924 buffered_rmqueue 86.7500
30454 __make_request 21.3862
28532 do_page_cache_readahead 54.0379
27783 delay_tsc 434.1094
27723 ext2_get_branch 69.3075
25726 shrink_list 13.1793
25246 __find_get_block 63.1150
24443 mpage_end_io_read 169.7431
23800 add_to_page_cache 87.5000
20669 do_softirq 71.7674
19895 system_call 452.1591
18906 do_mpage_readpage 14.4101
18783 page_referenced 73.3711
18137 free_hot_cold_page 59.6612
18015 __wake_up 225.1875
13790 radix_tree_insert 53.8672
11227 radix_tree_delete 31.8949
11188 __alloc_pages 11.4631
10885 __brelse 136.0625
10699 write_null 668.6875
10299 __pagevec_lru_add 35.7604
10100 page_address 42.0833
9968 ext2_get_block 7.5976
2.5.62 filemap.c/do_generic_mapping_read:
1 /*
2 * This is a generic file read routine, and uses the
3 * inode->i_op->readpage() function for the actual low-level
4 * stuff.
5 *
6 * This is really ugly. But the goto's actually try to clarify some
7 * of the logic when it comes to error handling etc.
8 * - note the struct file * is only passed for the use of readpage
9 */
10 void do_generic_mapping_read(struct address_space *mapping,
11 struct file_ra_state *ra,
12 struct file * filp,
13 loff_t *ppos,
14 read_descriptor_t * desc,
15 read_actor_t actor)
16 {
17 struct inode *inode = mapping->host;
18 unsigned long index, offset;
19 struct page *cached_page;
20 int error;
21
22 cached_page = NULL;
23 index = *ppos >> PAGE_CACHE_SHIFT;
24 offset = *ppos & ~PAGE_CACHE_MASK;
25
26 for (;;) {
27 struct page *page;
28 unsigned long end_index, nr, ret;
29
30 end_index = inode->i_size >> PAGE_CACHE_SHIFT;
31
32 if (index > end_index)
33 break;
34 nr = PAGE_CACHE_SIZE;
35 if (index == end_index) {
36 nr = inode->i_size & ~PAGE_CACHE_MASK;
37 if (nr <= offset)
38 break;
39 }
40
41 cond_resched();
42 page_cache_readahead(mapping, ra, filp, index);
43
44 nr = nr - offset;
45
46 /*
47 * Try to find the data in the page cache..
48 */
49 find_page:
50 read_lock(&mapping->page_lock);
51 page = radix_tree_lookup(&mapping->page_tree, index);
52 if (!page) {
53 read_unlock(&mapping->page_lock);
54 handle_ra_miss(mapping,ra);
55 goto no_cached_page;
56 }
57 page_cache_get(page);
58 read_unlock(&mapping->page_lock);
59
60 if (!PageUptodate(page))
61 goto page_not_up_to_date;
62 page_ok:
63 /* If users can be writing to this page using arbitrary
64 * virtual addresses, take care about potential aliasing
65 * before reading the page on the kernel side.
66 */
67 if (!list_empty(&mapping->i_mmap_shared))
68 flush_dcache_page(page);
69
70 /*
71 * Mark the page accessed if we read the beginning.
72 */
73 if (!offset)
74 mark_page_accessed(page);
75
76 /*
77 * Ok, we have the page, and it's up-to-date, so
78 * now we can copy it to user space...
79 *
80 * The actor routine returns how many bytes were actually
used..
81 * NOTE! This may not be the same as how much of a user
buffer
82 * we filled up (we may be padding etc), so we can only
update
83 * "pos" here (the actor routine has to update the user
buffer
84 * pointers and the remaining count).
85 */ 86 ret = actor(desc, page, offset,
nr);
87 offset += ret;
88 index += offset >> PAGE_CACHE_SHIFT;
89 offset &= ~PAGE_CACHE_MASK;
90
91 page_cache_release(page);
92 if (ret == nr && desc->count)
93 continue;
94 break;
95
96 page_not_up_to_date:
97 if (PageUptodate(page))
98 goto page_ok;
99
100 /* Get exclusive access to the page ... */
101 lock_page(page);
102
103 /* Did it get unhashed before we got the lock? */
104 if (!page->mapping) {
105 unlock_page(page);
106 page_cache_release(page);
107 continue;
108 }
109
110 /* Did somebody else fill it already? */
111 if (PageUptodate(page)) {
112 unlock_page(page);
113 goto page_ok;
114 }
115
116 readpage:
117 /* ... and start the actual read. The read will unlock the
page. */
118 error = mapping->a_ops->readpage(filp, page);
119
120 if (!error) {
121 if (PageUptodate(page))
122 goto page_ok;
123 wait_on_page_locked(page);
124 if (PageUptodate(page))
125 goto page_ok;
126 error = -EIO;
127 }
128
129 /* UHHUH! A synchronous read error occurred. Report it */
130 desc->error = error;
131 page_cache_release(page);
132 break;
133
134 no_cached_page:
135 /*
136 * Ok, it wasn't cached, so we need to create a new
137 * page..
138 */
139 if (!cached_page) {
140 cached_page = page_cache_alloc_cold(mapping);
141 if (!cached_page) {
142 desc->error = -ENOMEM;
143 break;
144 }
145 }
146 error = add_to_page_cache_lru(cached_page, mapping,
147 index, GFP_KERNEL);
148 if (error) {
149 if (error == -EEXIST)
150 goto find_page;
151 desc->error = error;
152 break;
153 }
154 page = cached_page;
155 cached_page = NULL;
156 goto readpage;
157 }
158
159 *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
160 if (cached_page)
161 page_cache_release(cached_page);
162 UPDATE_ATIME(inode);
163 }
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 19:16 ` Benjamin LaHaise
@ 2003-03-31 19:07 ` Janet Morgan
2003-04-01 20:24 ` Benjamin LaHaise
2003-03-31 19:17 ` William Lee Irwin III
2003-04-07 3:51 ` Suparna Bhattacharya
2 siblings, 1 reply; 15+ messages in thread
From: Janet Morgan @ 2003-03-31 19:07 UTC (permalink / raw)
To: Benjamin LaHaise
Cc: William Lee Irwin III, akpm, suparna, linux-aio, linux-kernel
Benjamin LaHaise wrote:
> On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
> > Can you tell whether these are due to hash collisions or contention on
> > the same page?
>
> No, they're most likely waiting for io to complete.
>
> To clean this up I've got a patch to move from aio_read/write with all the
> parameters to a single parameter based rw-specific iocb. That makes the
> retry for read and write more ameniable to sharing common logic akin to the
> wtd_ ops, which we need at the very least for the semaphore operations.
>
> -ben
>
Can you post the patch you're referring to?
Thanks,
-Janet
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 18:32 ` Janet Morgan
@ 2003-03-31 19:11 ` William Lee Irwin III
2003-03-31 19:16 ` Benjamin LaHaise
0 siblings, 1 reply; 15+ messages in thread
From: William Lee Irwin III @ 2003-03-31 19:11 UTC (permalink / raw)
To: Janet Morgan; +Cc: akpm, suparna, bcrl, linux-aio, linux-kernel
On Mon, Mar 31, 2003 at 10:32:20AM -0800, Janet Morgan wrote:
> 70% of all calls to schedule were from __lock_page:
> Based on the profile data, almost all calls to _lock_page were
> from do_generic_mapping_read (see filemap.c/Line 101 below).
> Suparna's patch already retries here.
Can you tell whether these are due to hash collisions or contention on
the same page?
If they're due to hash collisions, things could easily be done to help
(though they wouldn't guarantee not sleeping entirely they'd be good
for general performance).
-- wli
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 19:11 ` William Lee Irwin III
@ 2003-03-31 19:16 ` Benjamin LaHaise
2003-03-31 19:07 ` Janet Morgan
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Benjamin LaHaise @ 2003-03-31 19:16 UTC (permalink / raw)
To: William Lee Irwin III, Janet Morgan, akpm, suparna, linux-aio,
linux-kernel
On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
> Can you tell whether these are due to hash collisions or contention on
> the same page?
No, they're most likely waiting for io to complete.
To clean this up I've got a patch to move from aio_read/write with all the
parameters to a single parameter based rw-specific iocb. That makes the
retry for read and write more ameniable to sharing common logic akin to the
wtd_ ops, which we need at the very least for the semaphore operations.
-ben
--
Junk email? <a href="mailto:aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 19:16 ` Benjamin LaHaise
2003-03-31 19:07 ` Janet Morgan
@ 2003-03-31 19:17 ` William Lee Irwin III
2003-03-31 19:25 ` Benjamin LaHaise
2003-04-07 3:51 ` Suparna Bhattacharya
2 siblings, 1 reply; 15+ messages in thread
From: William Lee Irwin III @ 2003-03-31 19:17 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: Janet Morgan, akpm, suparna, linux-aio, linux-kernel
On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
>> Can you tell whether these are due to hash collisions or contention on
>> the same page?
On Mon, Mar 31, 2003 at 02:16:29PM -0500, Benjamin LaHaise wrote:
> No, they're most likely waiting for io to complete.
> To clean this up I've got a patch to move from aio_read/write with all the
> parameters to a single parameter based rw-specific iocb. That makes the
> retry for read and write more ameniable to sharing common logic akin to the
> wtd_ ops, which we need at the very least for the semaphore operations.
I won't get in the way then. I just watch for things related to what I've
touched to make sure it isn't going wrong for anyone.
-- wli
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 19:17 ` William Lee Irwin III
@ 2003-03-31 19:25 ` Benjamin LaHaise
0 siblings, 0 replies; 15+ messages in thread
From: Benjamin LaHaise @ 2003-03-31 19:25 UTC (permalink / raw)
To: William Lee Irwin III, Janet Morgan, akpm, suparna, linux-aio,
linux-kernel
On Mon, Mar 31, 2003 at 11:17:35AM -0800, William Lee Irwin III wrote:
> I won't get in the way then. I just watch for things related to what I've
> touched to make sure it isn't going wrong for anyone.
Longer term, I think you've got the right idea: we need to keep more
statistics on io waits, as right now from a profiling point of view, any
process that is blocked on io doesn't provide meaningful data to the
profiler.
-ben
--
Junk email? <a href="mailto:aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 19:07 ` Janet Morgan
@ 2003-04-01 20:24 ` Benjamin LaHaise
0 siblings, 0 replies; 15+ messages in thread
From: Benjamin LaHaise @ 2003-04-01 20:24 UTC (permalink / raw)
To: Janet Morgan
Cc: William Lee Irwin III, akpm, suparna, linux-aio, linux-kernel
On Mon, Mar 31, 2003 at 11:07:55AM -0800, Janet Morgan wrote:
> Can you post the patch you're referring to?
Something like this... It also converts the async rw ops into vectored
form.
-ben
drivers/char/raw.c | 23 ------
fs/adfs/file.c | 4 -
fs/affs/file.c | 19 -----
fs/afs/file.c | 2
fs/aio.c | 127 ++++++++++++++++++++++++-----------
fs/befs/linuxvfs.c | 4 -
fs/bfs/file.c | 4 -
fs/block_dev.c | 18 +----
fs/cifs/cifsfs.c | 4 -
fs/direct-io.c | 10 +-
fs/ext2/file.c | 4 -
fs/ext2/inode.c | 4 -
fs/ext3/file.c | 14 +--
fs/ext3/inode.c | 4 -
fs/fat/file.c | 15 +---
fs/freevxfs/vxfs_inode.c | 2
fs/hpfs/file.c | 14 ---
fs/hpfs/inode.c | 2
fs/jffs/inode-v23.c | 4 -
fs/jffs2/file.c | 4 -
fs/jfs/file.c | 4 -
fs/minix/file.c | 4 -
fs/nfs/file.c | 31 +++-----
fs/nfs/write.c | 4 -
fs/ntfs/aops.c | 2
fs/ntfs/file.c | 4 -
fs/qnx4/file.c | 4 -
fs/ramfs/inode.c | 4 -
fs/read_write.c | 166 +++++++++++++++++++++++------------------------
fs/reiserfs/file.c | 4 -
fs/smbfs/file.c | 20 ++---
fs/sysv/file.c | 4 -
fs/udf/file.c | 18 ++---
fs/ufs/file.c | 4 -
include/linux/aio.h | 67 ++++++++++++++++--
include/linux/fs.h | 32 +++------
include/net/sock.h | 14 ++-
kernel/ksyms.c | 5 -
mm/filemap.c | 145 ++++++++---------------------------------
net/socket.c | 67 ++++++++----------
40 files changed, 400 insertions(+), 485 deletions(-)
diff -purN linux-2.5/drivers/char/raw.c aio-2.5/drivers/char/raw.c
--- linux-2.5/drivers/char/raw.c Tue Apr 1 15:17:26 2003
+++ aio-2.5/drivers/char/raw.c Mon Mar 24 15:39:48 2003
@@ -220,33 +220,12 @@ out:
return err;
}
-static ssize_t raw_file_write(struct file *file, const char *buf,
- size_t count, loff_t *ppos)
-{
- struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
- return generic_file_write_nolock(file, &local_iov, 1, ppos);
-}
-
-static ssize_t raw_file_aio_write(struct kiocb *iocb, const char *buf,
- size_t count, loff_t pos)
-{
- struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
- return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos);
-}
-
-
static struct file_operations raw_fops = {
- .read = generic_file_read,
.aio_read = generic_file_aio_read,
- .write = raw_file_write,
- .aio_write = raw_file_aio_write,
+ .aio_write = generic_file_aio_write_nolock,
.open = raw_open,
.release= raw_release,
.ioctl = raw_ioctl,
- .readv = generic_file_readv,
- .writev = generic_file_writev,
.owner = THIS_MODULE,
};
diff -purN linux-2.5/fs/adfs/file.c aio-2.5/fs/adfs/file.c
--- linux-2.5/fs/adfs/file.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/adfs/file.c Mon Mar 24 11:25:00 2003
@@ -31,11 +31,11 @@
#include "adfs.h"
struct file_operations adfs_file_operations = {
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.llseek = generic_file_llseek,
- .read = generic_file_read,
.mmap = generic_file_mmap,
.fsync = file_fsync,
- .write = generic_file_write,
.sendfile = generic_file_sendfile,
};
diff -purN linux-2.5/fs/affs/file.c aio-2.5/fs/affs/file.c
--- linux-2.5/fs/affs/file.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/affs/file.c Mon Mar 24 11:25:00 2003
@@ -39,14 +39,13 @@ static int affs_grow_extcache(struct ino
static struct buffer_head *affs_alloc_extblock(struct inode *inode, struct buffer_head *bh, u32 ext);
static inline struct buffer_head *affs_get_extblock(struct inode *inode, u32 ext);
static struct buffer_head *affs_get_extblock_slow(struct inode *inode, u32 ext);
-static ssize_t affs_file_write(struct file *filp, const char *buf, size_t count, loff_t *ppos);
static int affs_file_open(struct inode *inode, struct file *filp);
static int affs_file_release(struct inode *inode, struct file *filp);
struct file_operations affs_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = affs_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
.open = affs_file_open,
.release = affs_file_release,
@@ -490,20 +489,6 @@ affs_getemptyblk_ino(struct inode *inode
return ERR_PTR(err);
}
-static ssize_t
-affs_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
-{
- ssize_t retval;
-
- retval = generic_file_write (file, buf, count, ppos);
- if (retval >0) {
- struct inode *inode = file->f_dentry->d_inode;
- inode->i_ctime = inode->i_mtime = CURRENT_TIME;
- mark_inode_dirty(inode);
- }
- return retval;
-}
-
static int
affs_do_readpage_ofs(struct file *file, struct page *page, unsigned from, unsigned to)
{
diff -purN linux-2.5/fs/afs/file.c aio-2.5/fs/afs/file.c
--- linux-2.5/fs/afs/file.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/afs/file.c Mon Mar 24 11:25:00 2003
@@ -35,7 +35,7 @@ struct inode_operations afs_file_inode_o
};
struct file_operations afs_file_file_operations = {
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
.write = afs_file_write,
.mmap = generic_file_readonly_mmap,
#if 0
diff -purN linux-2.5/fs/aio.c aio-2.5/fs/aio.c
--- linux-2.5/fs/aio.c Tue Mar 25 16:50:38 2003
+++ aio-2.5/fs/aio.c Mon Mar 31 14:41:41 2003
@@ -64,7 +64,7 @@ static void aio_kick_handler(void *);
*/
static int __init aio_setup(void)
{
- kiocb_cachep = kmem_cache_create("kiocb", sizeof(struct kiocb),
+ kiocb_cachep = kmem_cache_create("kiocb", sizeof(struct sync_iocb),
0, SLAB_HWCACHE_ALIGN, NULL, NULL);
if (!kiocb_cachep)
panic("unable to create kiocb cache\n");
@@ -148,7 +148,7 @@ static int aio_setup_ring(struct kioctx
dprintk("mmap address: 0x%08lx\n", info->mmap_base);
info->nr_pages = get_user_pages(current, ctx->mm,
- info->mmap_base, info->mmap_size,
+ info->mmap_base, nr_pages,
1, 0, info->ring_pages, NULL);
up_write(&ctx->mm->mmap_sem);
@@ -790,6 +790,20 @@ static inline void set_timeout(long star
add_timer(&to->timer);
}
+static inline void update_ts(struct timespec *ts, long jiffies)
+{
+ struct timespec tmp;
+ jiffies_to_timespec(jiffies, &tmp);
+ ts->tv_sec -= tmp.tv_sec;
+ ts->tv_nsec -= tmp.tv_nsec;
+ if (ts->tv_nsec < 0) {
+ ts->tv_nsec += 1000000000;
+ ts->tv_sec -= 1;
+ }
+ if (ts->tv_sec < 0)
+ ts->tv_sec = ts->tv_nsec = 0;
+}
+
static inline void clear_timeout(struct timeout *to)
{
del_timer_sync(&to->timer);
@@ -807,6 +821,7 @@ static int read_events(struct kioctx *ct
int i = 0;
struct io_event ent;
struct timeout to;
+ struct timespec ts;
/* needed to zero any padding within an entry (there shouldn't be
* any, but C is fun!
@@ -844,7 +859,6 @@ static int read_events(struct kioctx *ct
init_timeout(&to);
if (timeout) {
- struct timespec ts;
ret = -EFAULT;
if (unlikely(copy_from_user(&ts, timeout, sizeof(ts))))
goto out;
@@ -890,8 +904,12 @@ static int read_events(struct kioctx *ct
i ++;
}
- if (timeout)
+ if (timeout) {
clear_timeout(&to);
+ update_ts(&ts, jiffies - start_jiffies);
+ if (copy_to_user(timeout, &ts, sizeof(ts)))
+ ret = -EFAULT;
+ }
out:
return i ? i : ret;
}
@@ -984,6 +1002,63 @@ asmlinkage long sys_io_destroy(aio_conte
return -EINVAL;
}
+static ssize_t rw_issue(struct rw_iocb *rw, struct file *file,
+ struct iocb *iocb, ssize_t (*op)(struct rw_iocb *))
+{
+ ssize_t ret;
+
+ if (unlikely(NULL == op))
+ return -EINVAL;
+ if (unlikely(!(file->f_mode &
+ (rw->rw == WRITE ? FMODE_WRITE : FMODE_READ))))
+ return -EBADF;
+ if (unlikely(!access_ok((rw->rw == WRITE ? VERIFY_READ : VERIFY_WRITE),
+ rw->rw_local_iov.iov_base,
+ rw->rw_local_iov.iov_len)))
+ return -EFAULT;
+
+ rw->rw_local_iov.iov_base = (char *)(unsigned long)iocb->aio_buf;
+ rw->rw_local_iov.iov_len = iocb->aio_nbytes;
+ rw->rw_nsegs = 1;
+ rw->rw_iov = &rw->rw_local_iov;
+ rw->rw_pos = iocb->aio_offset;
+
+ ret = op(rw);
+ if (-EIOCBQUEUED != ret)
+ aio_complete(&rw->kiocb, ret, 0);
+ return 0;
+}
+
+static ssize_t io_submit_pread(struct kiocb *kiocb, struct file *file,
+ struct iocb *iocb)
+{
+ struct rw_iocb *rw = kiocb_to_rw_iocb(kiocb);
+ rw->rw = READ;
+ return rw_issue(rw, file, iocb, file->f_op->aio_read);
+}
+
+static ssize_t io_submit_pwrite(struct kiocb *kiocb, struct file *file,
+ struct iocb *iocb)
+{
+ struct rw_iocb *rw = (struct rw_iocb *)kiocb;
+ rw->rw = WRITE;
+ return rw_issue(rw, file, iocb, file->f_op->aio_write);
+}
+static ssize_t io_submit_fsync(struct kiocb *kiocb, struct file *file,
+ int dsync)
+{
+ struct fsync_iocb *fsync_iocb = kiocb_to_fsync_iocb(kiocb);
+ ssize_t ret = -EINVAL;
+ fsync_iocb->dsync = dsync;
+ if (NULL != file->f_op->aio_fsync) {
+ ret = file->f_op->aio_fsync(fsync_iocb);
+ if (-EIOCBQUEUED != ret)
+ aio_complete(&fsync_iocb->kiocb, ret, 0);
+ ret = 0;
+ }
+ return ret;
+}
+
static int FASTCALL(io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
struct iocb *iocb));
static int io_submit_one(struct kioctx *ctx, struct iocb *user_iocb,
@@ -992,7 +1067,6 @@ static int io_submit_one(struct kioctx *
struct kiocb *req;
struct file *file;
ssize_t ret;
- char *buf;
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 ||
@@ -1031,57 +1105,30 @@ static int io_submit_one(struct kioctx *
req->ki_user_obj = user_iocb;
req->ki_user_data = iocb->aio_data;
- req->ki_pos = iocb->aio_offset;
-
- buf = (char *)(unsigned long)iocb->aio_buf;
switch (iocb->aio_lio_opcode) {
case IOCB_CMD_PREAD:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_READ)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_read)
- ret = file->f_op->aio_read(req, buf,
- iocb->aio_nbytes, req->ki_pos);
+ ret = io_submit_pread(req, file, iocb);
break;
case IOCB_CMD_PWRITE:
- ret = -EBADF;
- if (unlikely(!(file->f_mode & FMODE_WRITE)))
- goto out_put_req;
- ret = -EFAULT;
- if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes)))
- goto out_put_req;
- ret = -EINVAL;
- if (file->f_op->aio_write)
- ret = file->f_op->aio_write(req, buf,
- iocb->aio_nbytes, req->ki_pos);
+ ret = io_submit_pwrite(req, file, iocb);
break;
case IOCB_CMD_FDSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 1);
+ ret = io_submit_fsync(req, file, 1);
break;
case IOCB_CMD_FSYNC:
- ret = -EINVAL;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(req, 0);
+ ret = io_submit_fsync(req, file, 0);
break;
default:
dprintk("EINVAL: io_submit: no operation provided\n");
ret = -EINVAL;
}
- if (likely(-EIOCBQUEUED == ret))
- return 0;
- aio_complete(req, ret, 0);
- return 0;
-
+ if (-EIOCBQUEUED == ret)
+ ret = 0;
out_put_req:
- aio_put_req(req);
+ if (ret)
+ aio_put_req(req);
return ret;
}
diff -purN linux-2.5/fs/befs/linuxvfs.c aio-2.5/fs/befs/linuxvfs.c
--- linux-2.5/fs/befs/linuxvfs.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/befs/linuxvfs.c Mon Mar 24 11:25:00 2003
@@ -73,7 +73,7 @@ struct inode_operations befs_dir_inode_o
struct file_operations befs_file_operations = {
.llseek = default_llseek,
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
.mmap = generic_file_readonly_mmap,
};
@@ -89,7 +89,7 @@ static struct inode_operations befs_syml
};
/*
- * Called by generic_file_read() to read a page of data
+ * Called by generic_file_aio_read() to read a page of data
*
* In turn, simply calls a generic block read function and
* passes it the address of befs_get_block, for mapping file
diff -purN linux-2.5/fs/bfs/file.c aio-2.5/fs/bfs/file.c
--- linux-2.5/fs/bfs/file.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/bfs/file.c Mon Mar 24 11:25:00 2003
@@ -19,8 +19,8 @@
struct file_operations bfs_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
.sendfile = generic_file_sendfile,
};
diff -purN linux-2.5/fs/block_dev.c aio-2.5/fs/block_dev.c
--- linux-2.5/fs/block_dev.c Tue Apr 1 15:17:52 2003
+++ aio-2.5/fs/block_dev.c Mon Mar 24 17:25:12 2003
@@ -118,10 +118,10 @@ blkdev_get_blocks(struct inode *inode, s
}
static int
-blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+blkdev_direct_IO(int rw, struct rw_iocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs)
{
- struct file *file = iocb->ki_filp;
+ struct file *file = iocb->kiocb.ki_filp;
struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
return blockdev_direct_IO(rw, iocb, inode, inode->i_bdev, iov, offset,
@@ -694,14 +694,6 @@ int blkdev_close(struct inode * inode, s
return blkdev_put(inode->i_bdev, BDEV_FILE);
}
-static ssize_t blkdev_file_write(struct file *file, const char *buf,
- size_t count, loff_t *ppos)
-{
- struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
- return generic_file_write_nolock(file, &local_iov, 1, ppos);
-}
-
struct address_space_operations def_blk_aops = {
.readpage = blkdev_readpage,
.writepage = blkdev_writepage,
@@ -716,13 +708,11 @@ struct file_operations def_blk_fops = {
.open = blkdev_open,
.release = blkdev_close,
.llseek = block_llseek,
- .read = generic_file_read,
- .write = blkdev_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write_nolock,
.mmap = generic_file_mmap,
.fsync = block_fsync,
.ioctl = blkdev_ioctl,
- .readv = generic_file_readv,
- .writev = generic_file_writev,
.sendfile = generic_file_sendfile,
};
diff -purN linux-2.5/fs/cifs/cifsfs.c aio-2.5/fs/cifs/cifsfs.c
--- linux-2.5/fs/cifs/cifsfs.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/cifs/cifsfs.c Mon Mar 24 11:25:00 2003
@@ -316,8 +316,8 @@ struct inode_operations cifs_symlink_ino
};
struct file_operations cifs_file_ops = {
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.open = cifs_open,
.release = cifs_close,
.lock = cifs_lock,
diff -purN linux-2.5/fs/direct-io.c aio-2.5/fs/direct-io.c
--- linux-2.5/fs/direct-io.c Tue Apr 1 15:17:52 2003
+++ aio-2.5/fs/direct-io.c Mon Mar 24 17:26:02 2003
@@ -113,7 +113,7 @@ struct dio {
struct task_struct *waiter; /* waiting task (NULL if none) */
/* AIO related stuff */
- struct kiocb *iocb; /* kiocb */
+ struct rw_iocb *iocb; /* kiocb */
int is_async; /* is IO async ? */
int result; /* IO result */
};
@@ -200,7 +200,7 @@ static void finished_one_bio(struct dio
{
if (atomic_dec_and_test(&dio->bio_count)) {
if(dio->is_async) {
- aio_complete(dio->iocb, dio->result, 0);
+ aio_complete(&dio->iocb->kiocb, dio->result, 0);
kfree(dio);
}
}
@@ -822,7 +822,7 @@ out:
}
static int
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
+direct_io_worker(int rw, struct rw_iocb *iocb, struct inode *inode,
const struct iovec *iov, loff_t offset, unsigned long nr_segs,
unsigned blkbits, get_blocks_t get_blocks)
{
@@ -836,7 +836,7 @@ direct_io_worker(int rw, struct kiocb *i
dio = kmalloc(sizeof(*dio), GFP_KERNEL);
if (!dio)
return -ENOMEM;
- dio->is_async = !is_sync_kiocb(iocb);
+ dio->is_async = !is_sync_kiocb(&iocb->kiocb);
dio->bio = NULL;
dio->inode = inode;
@@ -960,7 +960,7 @@ direct_io_worker(int rw, struct kiocb *i
* This is a library function for use by filesystem drivers.
*/
int
-blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
+blockdev_direct_IO(int rw, struct rw_iocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
unsigned long nr_segs, get_blocks_t get_blocks)
{
diff -purN linux-2.5/fs/ext2/file.c aio-2.5/fs/ext2/file.c
--- linux-2.5/fs/ext2/file.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/ext2/file.c Mon Mar 24 15:39:52 2003
@@ -41,8 +41,6 @@ static int ext2_release_file (struct ino
*/
struct file_operations ext2_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = generic_file_write,
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
.ioctl = ext2_ioctl,
@@ -50,8 +48,6 @@ struct file_operations ext2_file_operati
.open = generic_file_open,
.release = ext2_release_file,
.fsync = ext2_sync_file,
- .readv = generic_file_readv,
- .writev = generic_file_writev,
.sendfile = generic_file_sendfile,
};
diff -purN linux-2.5/fs/ext2/inode.c aio-2.5/fs/ext2/inode.c
--- linux-2.5/fs/ext2/inode.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/ext2/inode.c Mon Mar 24 17:26:06 2003
@@ -650,10 +650,10 @@ ext2_get_blocks(struct inode *inode, sec
}
static int
-ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+ext2_direct_IO(int rw, struct rw_iocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs)
{
- struct file *file = iocb->ki_filp;
+ struct file *file = iocb->kiocb.ki_filp;
struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
diff -purN linux-2.5/fs/ext3/file.c aio-2.5/fs/ext3/file.c
--- linux-2.5/fs/ext3/file.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/ext3/file.c Mon Mar 24 17:26:09 2003
@@ -56,13 +56,13 @@ static int ext3_open_file (struct inode
}
static ssize_t
-ext3_file_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos)
+ext3_file_write(struct rw_iocb *iocb)
{
- struct file *file = iocb->ki_filp;
+ struct file *file = iocb->kiocb.ki_filp;
struct inode *inode = file->f_dentry->d_inode;
int ret, err;
- ret = generic_file_aio_write(iocb, buf, count, pos);
+ ret = generic_file_aio_write(iocb);
/*
* Skip flushing if there was an error, or if nothing was written.
@@ -114,12 +114,8 @@ force_commit:
struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
- .read = do_sync_read,
- .write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = ext3_file_write,
- .readv = generic_file_readv,
- .writev = generic_file_writev,
+ .aio_read = generic_file_aio_read,
+ .aio_write = ext3_file_write,
.ioctl = ext3_ioctl,
.mmap = generic_file_mmap,
.open = ext3_open_file,
diff -purN linux-2.5/fs/ext3/inode.c aio-2.5/fs/ext3/inode.c
--- linux-2.5/fs/ext3/inode.c Tue Apr 1 15:17:53 2003
+++ aio-2.5/fs/ext3/inode.c Mon Mar 24 17:26:13 2003
@@ -1426,11 +1426,11 @@ static int ext3_releasepage(struct page
* If the O_DIRECT write is intantiating holes inside i_size and the machine
* crashes then stale disk data _may_ be exposed inside the file.
*/
-static int ext3_direct_IO(int rw, struct kiocb *iocb,
+static int ext3_direct_IO(int rw, struct rw_iocb *iocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs)
{
- struct file *file = iocb->ki_filp;
+ struct file *file = iocb->kiocb.ki_filp;
struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
struct ext3_inode_info *ei = EXT3_I(inode);
handle_t *handle = NULL;
diff -purN linux-2.5/fs/fat/file.c aio-2.5/fs/fat/file.c
--- linux-2.5/fs/fat/file.c Tue Apr 1 15:17:54 2003
+++ aio-2.5/fs/fat/file.c Mon Mar 24 17:54:55 2003
@@ -11,13 +11,12 @@
#include <linux/smp_lock.h>
#include <linux/buffer_head.h>
-static ssize_t fat_file_write(struct file *filp, const char *buf, size_t count,
- loff_t *ppos);
+static ssize_t fat_file_write(struct rw_iocb *iocb);
struct file_operations fat_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = fat_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = fat_file_write,
.mmap = generic_file_mmap,
.fsync = file_fsync,
.sendfile = generic_file_sendfile,
@@ -65,14 +64,14 @@ int fat_get_block(struct inode *inode, s
return 0;
}
-static ssize_t fat_file_write(struct file *filp, const char *buf, size_t count,
- loff_t *ppos)
+static ssize_t fat_file_write(struct rw_iocb *iocb)
{
- struct inode *inode = filp->f_dentry->d_inode;
+ struct file *filp = iocb->kiocb.ki_filp;
int retval;
- retval = generic_file_write(filp, buf, count, ppos);
+ retval = generic_file_aio_write(iocb);
if (retval > 0) {
+ struct inode *inode = filp->f_dentry->d_inode;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
MSDOS_I(inode)->i_attrs |= ATTR_ARCH;
mark_inode_dirty(inode);
diff -purN linux-2.5/fs/freevxfs/vxfs_inode.c aio-2.5/fs/freevxfs/vxfs_inode.c
--- linux-2.5/fs/freevxfs/vxfs_inode.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/freevxfs/vxfs_inode.c Mon Mar 24 11:25:00 2003
@@ -51,7 +51,7 @@ extern struct inode_operations vxfs_imme
static struct file_operations vxfs_file_operations = {
.open = generic_file_open,
.llseek = generic_file_llseek,
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
.mmap = generic_file_mmap,
.sendfile = generic_file_sendfile,
};
diff -purN linux-2.5/fs/hpfs/file.c aio-2.5/fs/hpfs/file.c
--- linux-2.5/fs/hpfs/file.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/hpfs/file.c Mon Mar 24 11:25:00 2003
@@ -123,17 +123,3 @@ struct address_space_operations hpfs_aop
.commit_write = generic_commit_write,
.bmap = _hpfs_bmap
};
-
-ssize_t hpfs_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
-{
- ssize_t retval;
-
- retval = generic_file_write(file, buf, count, ppos);
- if (retval > 0) {
- struct inode *inode = file->f_dentry->d_inode;
- inode->i_mtime = CURRENT_TIME;
- hpfs_i(inode)->i_dirty = 1;
- }
- return retval;
-}
-
diff -purN linux-2.5/fs/hpfs/inode.c aio-2.5/fs/hpfs/inode.c
--- linux-2.5/fs/hpfs/inode.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/hpfs/inode.c Mon Mar 24 11:25:00 2003
@@ -15,7 +15,7 @@
static struct file_operations hpfs_file_ops =
{
.llseek = generic_file_llseek,
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
.write = hpfs_file_write,
.mmap = generic_file_mmap,
.open = hpfs_open,
diff -purN linux-2.5/fs/jffs/inode-v23.c aio-2.5/fs/jffs/inode-v23.c
--- linux-2.5/fs/jffs/inode-v23.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/jffs/inode-v23.c Mon Mar 24 11:25:00 2003
@@ -1639,8 +1639,8 @@ static struct file_operations jffs_file_
{
.open = generic_file_open,
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.ioctl = jffs_ioctl,
.mmap = generic_file_readonly_mmap,
.fsync = jffs_fsync,
diff -purN linux-2.5/fs/jffs2/file.c aio-2.5/fs/jffs2/file.c
--- linux-2.5/fs/jffs2/file.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/jffs2/file.c Mon Mar 24 11:46:18 2003
@@ -55,8 +55,8 @@ struct file_operations jffs2_file_operat
{
.llseek = generic_file_llseek,
.open = generic_file_open,
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.ioctl = jffs2_ioctl,
.mmap = generic_file_readonly_mmap,
.fsync = jffs2_fsync,
diff -purN linux-2.5/fs/jfs/file.c aio-2.5/fs/jfs/file.c
--- linux-2.5/fs/jfs/file.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/jfs/file.c Mon Mar 24 15:40:01 2003
@@ -100,13 +100,9 @@ struct inode_operations jfs_file_inode_o
struct file_operations jfs_file_operations = {
.open = jfs_open,
.llseek = generic_file_llseek,
- .write = generic_file_write,
- .read = generic_file_read,
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
- .readv = generic_file_readv,
- .writev = generic_file_writev,
.sendfile = generic_file_sendfile,
.fsync = jfs_fsync,
.release = jfs_release,
diff -purN linux-2.5/fs/minix/file.c aio-2.5/fs/minix/file.c
--- linux-2.5/fs/minix/file.c Tue Apr 1 15:17:55 2003
+++ aio-2.5/fs/minix/file.c Mon Mar 24 11:25:00 2003
@@ -17,8 +17,8 @@ int minix_sync_file(struct file *, struc
struct file_operations minix_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
.fsync = minix_sync_file,
.sendfile = generic_file_sendfile,
diff -purN linux-2.5/fs/nfs/file.c aio-2.5/fs/nfs/file.c
--- linux-2.5/fs/nfs/file.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/nfs/file.c Mon Mar 24 17:26:18 2003
@@ -36,17 +36,15 @@
static int nfs_file_mmap(struct file *, struct vm_area_struct *);
static ssize_t nfs_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *);
-static ssize_t nfs_file_read(struct kiocb *, char *, size_t, loff_t);
-static ssize_t nfs_file_write(struct kiocb *, const char *, size_t, loff_t);
+static ssize_t nfs_file_read(struct rw_iocb *);
+static ssize_t nfs_file_write(struct rw_iocb *);
static int nfs_file_flush(struct file *);
static int nfs_fsync(struct file *, struct dentry *dentry, int datasync);
struct file_operations nfs_file_operations = {
.llseek = remote_llseek,
- .read = do_sync_read,
- .write = do_sync_write,
- .aio_read = nfs_file_read,
- .aio_write = nfs_file_write,
+ .aio_read = nfs_file_read,
+ .aio_write = nfs_file_write,
.mmap = nfs_file_mmap,
.open = nfs_open,
.flush = nfs_file_flush,
@@ -88,19 +86,19 @@ nfs_file_flush(struct file *file)
}
static ssize_t
-nfs_file_read(struct kiocb *iocb, char * buf, size_t count, loff_t pos)
+nfs_file_read(struct rw_iocb *iocb)
{
- struct dentry * dentry = iocb->ki_filp->f_dentry;
+ struct dentry * dentry = iocb->kiocb.ki_filp->f_dentry;
struct inode * inode = dentry->d_inode;
ssize_t result;
dfprintk(VFS, "nfs: read(%s/%s, %lu@%lu)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
- (unsigned long) count, (unsigned long) pos);
+ (unsigned long)iocb->rw_iov->iov_len, (unsigned long)iocb->rw_pos);
result = nfs_revalidate_inode(NFS_SERVER(inode), inode);
if (!result)
- result = generic_file_aio_read(iocb, buf, count, pos);
+ result = generic_file_aio_read(iocb);
return result;
}
@@ -202,15 +200,16 @@ struct address_space_operations nfs_file
* Write to a file (through the page cache).
*/
static ssize_t
-nfs_file_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos)
+nfs_file_write(struct rw_iocb *iocb)
{
- struct dentry * dentry = iocb->ki_filp->f_dentry;
+ struct dentry * dentry = iocb->kiocb.ki_filp->f_dentry;
struct inode * inode = dentry->d_inode;
ssize_t result;
dfprintk(VFS, "nfs: write(%s/%s(%ld), %lu@%lu)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
- inode->i_ino, (unsigned long) count, (unsigned long) pos);
+ inode->i_ino, (unsigned long)iocb->rw_iov->iov_len,
+ (unsigned long)iocb->rw_pos);
result = -EBUSY;
if (IS_SWAPFILE(inode))
@@ -219,11 +218,7 @@ nfs_file_write(struct kiocb *iocb, const
if (result)
goto out;
- result = count;
- if (!count)
- goto out;
-
- result = generic_file_aio_write(iocb, buf, count, pos);
+ result = generic_file_aio_write(iocb);
out:
return result;
diff -purN linux-2.5/fs/nfs/write.c aio-2.5/fs/nfs/write.c
--- linux-2.5/fs/nfs/write.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/nfs/write.c Mon Mar 24 11:25:01 2003
@@ -28,7 +28,7 @@
*
* - A write request is in progress.
* - A user process is in generic_file_write/nfs_update_page
- * - A user process is in generic_file_read
+ * - A user process is in generic_file_aio_read
*
* Also note that because of the way pages are invalidated in
* nfs_revalidate_inode, the following assertions hold:
@@ -645,7 +645,7 @@ nfs_flush_incompatible(struct file *file
/*
* Update and possibly write a cached page of an NFS file.
*
- * XXX: Keep an eye on generic_file_read to make sure it doesn't do bad
+ * XXX: Keep an eye on generic_file_aio_read to make sure it doesn't do bad
* things with a page scheduled for an RPC call (e.g. invalidate it).
*/
int
diff -purN linux-2.5/fs/ntfs/aops.c aio-2.5/fs/ntfs/aops.c
--- linux-2.5/fs/ntfs/aops.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/ntfs/aops.c Mon Mar 24 11:25:01 2003
@@ -157,7 +157,7 @@ still_busy:
* unlocking it.
*
* We only enforce allocated_size limit because i_size is checked for in
- * generic_file_read().
+ * generic_file_aio_read().
*
* Return 0 on success and -errno on error.
*
diff -purN linux-2.5/fs/ntfs/file.c aio-2.5/fs/ntfs/file.c
--- linux-2.5/fs/ntfs/file.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/ntfs/file.c Mon Mar 24 11:46:18 2003
@@ -50,9 +50,9 @@ static int ntfs_file_open(struct inode *
struct file_operations ntfs_file_ops = {
.llseek = generic_file_llseek, /* Seek inside file. */
- .read = generic_file_read, /* Read from file. */
+ .aio_read = generic_file_aio_read,/* Read from file. */
#ifdef NTFS_RW
- .write = generic_file_write, /* Write to a file. */
+ .aio_write = generic_file_aio_write,/* Write to a file. */
#endif
.mmap = generic_file_mmap, /* Mmap file. */
.sendfile = generic_file_sendfile,/* Zero-copy data send with the
diff -purN linux-2.5/fs/qnx4/file.c aio-2.5/fs/qnx4/file.c
--- linux-2.5/fs/qnx4/file.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/qnx4/file.c Mon Mar 24 11:25:01 2003
@@ -25,11 +25,11 @@
struct file_operations qnx4_file_operations =
{
.llseek = generic_file_llseek,
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
.mmap = generic_file_mmap,
.sendfile = generic_file_sendfile,
#ifdef CONFIG_QNX4FS_RW
- .write = generic_file_write,
+ .aio_write = generic_file_aio_write,
.fsync = qnx4_sync_file,
#endif
};
diff -purN linux-2.5/fs/ramfs/inode.c aio-2.5/fs/ramfs/inode.c
--- linux-2.5/fs/ramfs/inode.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/ramfs/inode.c Mon Mar 24 11:25:01 2003
@@ -141,8 +141,8 @@ static struct address_space_operations r
};
static struct file_operations ramfs_file_operations = {
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
.fsync = simple_sync_file,
.sendfile = generic_file_sendfile,
diff -purN linux-2.5/fs/read_write.c aio-2.5/fs/read_write.c
--- linux-2.5/fs/read_write.c Tue Mar 25 16:50:38 2003
+++ aio-2.5/fs/read_write.c Tue Mar 25 16:17:41 2003
@@ -18,7 +18,7 @@
struct file_operations generic_ro_fops = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
.mmap = generic_file_readonly_mmap,
.sendfile = generic_file_sendfile,
};
@@ -167,28 +167,44 @@ bad:
}
#endif
-ssize_t do_sync_read(struct file *filp, char *buf, size_t len, loff_t *ppos)
+static ssize_t do_sync_rwv(struct file *filp, char *buf, size_t tot_len,
+ struct iovec *iov, unsigned nr_segs, loff_t *ppos,
+ ssize_t (*op)(struct rw_iocb *), int rw)
{
- struct kiocb kiocb;
+ struct sync_iocb sync_iocb;
+ struct rw_iocb *iocb = kiocb_to_rw_iocb(&sync_iocb.kiocb);
ssize_t ret;
- init_sync_kiocb(&kiocb, filp);
- kiocb.ki_pos = *ppos;
- ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos);
+ init_sync_kiocb(&iocb->kiocb, filp);
+ iocb->rw = rw;
+ iocb->rw_pos = *ppos;
+ iocb->rw_nsegs = nr_segs;
+ iocb->rw_iov = (NULL == iov) ? &iocb->rw_local_iov : iov;
+ iocb->rw_local_iov.iov_base = buf;
+ iocb->rw_local_iov.iov_len = tot_len;
+
+ ret = op(iocb);
if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&kiocb);
- *ppos = kiocb.ki_pos;
+ ret = wait_on_sync_kiocb(&iocb->kiocb);
+ *ppos = iocb->rw_pos;
return ret;
}
+ssize_t do_sync_rw(struct file *filp, char *buf, size_t count, loff_t *ppos,
+ ssize_t (*op)(struct rw_iocb *), int rw)
+{
+ return do_sync_rwv(filp, buf, count, NULL, 1, ppos, op, rw);
+}
+
ssize_t vfs_read(struct file *file, char *buf, size_t count, loff_t *pos)
{
struct inode *inode = file->f_dentry->d_inode;
ssize_t ret;
- if (!(file->f_mode & FMODE_READ))
+ if (unlikely(!(file->f_mode & FMODE_READ)))
return -EBADF;
- if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
+ if (unlikely(!file->f_op ||
+ (!file->f_op->read && !file->f_op->aio_read)))
return -EINVAL;
ret = locks_verify_area(FLOCK_VERIFY_READ, inode, file, *pos, count);
@@ -198,7 +214,8 @@ ssize_t vfs_read(struct file *file, char
if (file->f_op->read)
ret = file->f_op->read(file, buf, count, pos);
else
- ret = do_sync_read(file, buf, count, pos);
+ ret = do_sync_rw(file, buf, count, pos,
+ file->f_op->aio_read, READ);
if (ret > 0)
dnotify_parent(file->f_dentry, DN_ACCESS);
}
@@ -207,28 +224,15 @@ ssize_t vfs_read(struct file *file, char
return ret;
}
-ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos)
-{
- struct kiocb kiocb;
- ssize_t ret;
-
- init_sync_kiocb(&kiocb, filp);
- kiocb.ki_pos = *ppos;
- ret = filp->f_op->aio_write(&kiocb, buf, len, kiocb.ki_pos);
- if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&kiocb);
- *ppos = kiocb.ki_pos;
- return ret;
-}
-
ssize_t vfs_write(struct file *file, const char *buf, size_t count, loff_t *pos)
{
struct inode *inode = file->f_dentry->d_inode;
ssize_t ret;
- if (!(file->f_mode & FMODE_WRITE))
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
return -EBADF;
- if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+ if (unlikely(!file->f_op ||
+ (!file->f_op->write && !file->f_op->aio_write)))
return -EINVAL;
ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, *pos, count);
@@ -238,7 +242,8 @@ ssize_t vfs_write(struct file *file, con
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else
- ret = do_sync_write(file, buf, count, pos);
+ ret = do_sync_rw(file, (char *)buf, count, pos,
+ file->f_op->aio_write, WRITE);
if (ret > 0)
dnotify_parent(file->f_dentry, DN_MODIFY);
}
@@ -331,29 +336,58 @@ unsigned long iov_shorten(struct iovec *
return seg;
}
-static ssize_t do_readv_writev(int type, struct file *file,
+ssize_t compat_rwv(struct file *file, const struct iovec *vector,
+ unsigned long nr_segs, loff_t *ppos,
+ ssize_t (*fn)(struct file *, char *, size_t, loff_t *))
+{
+ ssize_t ret = 0;
+
+ if (NULL == fn)
+ return -EINVAL;
+
+ /* Do it by hand, with file-ops */
+ for (; nr_segs > 0; vector++, nr_segs--) {
+ void *base = vector->iov_base;
+ size_t len = vector->iov_len;
+ ssize_t nr = fn(file, base, len, ppos);
+
+ if (nr < 0) {
+ if (!ret)
+ ret = nr;
+ break;
+ }
+ ret += nr;
+ if (nr != len)
+ break;
+ }
+ return ret;
+}
+
+static ssize_t do_readv_writev(int fmode, int type, struct file *file,
const struct iovec * vector,
unsigned long nr_segs, loff_t *pos)
{
typedef ssize_t (*io_fn_t)(struct file *, char *, size_t, loff_t *);
typedef ssize_t (*iov_fn_t)(struct file *, const struct iovec *, unsigned long, loff_t *);
+ typedef ssize_t (*ioa_fn_t)(struct rw_iocb *);
- size_t tot_len;
+ size_t tot_len = 0;
struct iovec iovstack[UIO_FASTIOV];
- struct iovec *iov=iovstack;
- ssize_t ret;
+ struct iovec *iov = iovstack;
+ ssize_t ret = 0;
int seg;
io_fn_t fn;
iov_fn_t fnv;
- struct inode *inode;
+ ioa_fn_t fna;
+ if (!(file->f_mode & fmode))
+ return -EBADF;
/*
* SuS says "The readv() function *may* fail if the iovcnt argument
* was less than or equal to 0, or greater than {IOV_MAX}. Linux has
* traditionally returned zero for zero segments, so...
*/
- ret = 0;
- if (nr_segs == 0)
+ if (unlikely(nr_segs == 0))
goto out;
/*
@@ -365,6 +399,7 @@ static ssize_t do_readv_writev(int type,
goto out;
if (!file->f_op)
goto out;
+
if (nr_segs > UIO_FASTIOV) {
ret = -ENOMEM;
iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL);
@@ -382,7 +417,6 @@ static ssize_t do_readv_writev(int type,
*
* Be careful here because iov_len is a size_t not an ssize_t
*/
- tot_len = 0;
ret = -EINVAL;
for (seg = 0 ; seg < nr_segs; seg++) {
ssize_t tmp = tot_len;
@@ -393,55 +427,32 @@ static ssize_t do_readv_writev(int type,
if (tot_len < tmp) /* maths overflow on the ssize_t */
goto out;
}
- if (tot_len == 0) {
- ret = 0;
+ ret = 0;
+ if (tot_len == 0)
goto out;
- }
- inode = file->f_dentry->d_inode;
/* VERIFY_WRITE actually means a read, as we write to user space */
ret = locks_verify_area((type == READ
? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE),
- inode, file, *pos, tot_len);
+ file->f_dentry->d_inode, file, *pos, tot_len);
if (ret)
goto out;
- fnv = NULL;
if (type == READ) {
fn = file->f_op->read;
fnv = file->f_op->readv;
+ fna = file->f_op->aio_read;
} else {
fn = (io_fn_t)file->f_op->write;
fnv = file->f_op->writev;
+ fna = file->f_op->aio_write;
}
- if (fnv) {
+ if (fnv)
ret = fnv(file, iov, nr_segs, pos);
- goto out;
- }
-
- /* Do it by hand, with file-ops */
- ret = 0;
- vector = iov;
- while (nr_segs > 0) {
- void * base;
- size_t len;
- ssize_t nr;
-
- base = vector->iov_base;
- len = vector->iov_len;
- vector++;
- nr_segs--;
-
- nr = fn(file, base, len, pos);
-
- if (nr < 0) {
- if (!ret) ret = nr;
- break;
- }
- ret += nr;
- if (nr != len)
- break;
- }
+ else if (fna)
+ ret = do_sync_rwv(file, NULL, tot_len, iov, nr_segs, pos, fna, type);
+ else
+ ret = compat_rwv(file, iov, nr_segs, pos, fn);
out:
if (iov != iovstack)
kfree(iov);
@@ -454,23 +465,13 @@ out:
ssize_t vfs_readv(struct file *file, const struct iovec *vec,
unsigned long vlen, loff_t *pos)
{
- if (!(file->f_mode & FMODE_READ))
- return -EBADF;
- if (!file->f_op || (!file->f_op->readv && !file->f_op->read))
- return -EINVAL;
-
- return do_readv_writev(READ, file, vec, vlen, pos);
+ return do_readv_writev(FMODE_READ, READ, file, vec, vlen, pos);
}
ssize_t vfs_writev(struct file *file, const struct iovec *vec,
unsigned long vlen, loff_t *pos)
{
- if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
- if (!file->f_op || (!file->f_op->writev && !file->f_op->write))
- return -EINVAL;
-
- return do_readv_writev(WRITE, file, vec, vlen, pos);
+ return do_readv_writev(FMODE_WRITE, WRITE, file, vec, vlen, pos);
}
@@ -622,5 +623,4 @@ asmlinkage ssize_t sys_sendfile64(int ou
return do_sendfile(out_fd, in_fd, NULL, count, 0);
}
-EXPORT_SYMBOL(do_sync_read);
-EXPORT_SYMBOL(do_sync_write);
+EXPORT_SYMBOL(do_sync_rw);
diff -purN linux-2.5/fs/reiserfs/file.c aio-2.5/fs/reiserfs/file.c
--- linux-2.5/fs/reiserfs/file.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/reiserfs/file.c Mon Mar 24 11:25:01 2003
@@ -141,8 +141,8 @@ out:
}
struct file_operations reiserfs_file_operations = {
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.ioctl = reiserfs_ioctl,
.mmap = generic_file_mmap,
.release = reiserfs_file_release,
diff -purN linux-2.5/fs/smbfs/file.c aio-2.5/fs/smbfs/file.c
--- linux-2.5/fs/smbfs/file.c Tue Apr 1 15:17:56 2003
+++ aio-2.5/fs/smbfs/file.c Mon Mar 24 11:46:18 2003
@@ -216,13 +216,13 @@ smb_updatepage(struct file *file, struct
}
static ssize_t
-smb_file_read(struct file * file, char * buf, size_t count, loff_t *ppos)
+smb_file_read(struct kiocb *iocb, char *buf, size_t count, loff_t pos)
{
- struct dentry * dentry = file->f_dentry;
+ struct dentry * dentry = iocb->ki_filp->f_dentry;
ssize_t status;
VERBOSE("file %s/%s, count=%lu@%lu\n", DENTRY_PATH(dentry),
- (unsigned long) count, (unsigned long) *ppos);
+ (unsigned long) count, (unsigned long) pos);
status = smb_revalidate_inode(dentry);
if (status) {
@@ -235,7 +235,7 @@ smb_file_read(struct file * file, char *
(long)dentry->d_inode->i_size,
dentry->d_inode->i_flags, dentry->d_inode->i_atime);
- status = generic_file_read(file, buf, count, ppos);
+ status = generic_file_aio_read(iocb, buf, count, pos);
out:
return status;
}
@@ -298,14 +298,14 @@ struct address_space_operations smb_file
* Write to a file (through the page cache).
*/
static ssize_t
-smb_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
+smb_file_write(struct kiocb *iocb, const char *buf, size_t count, loff_t pos)
{
- struct dentry * dentry = file->f_dentry;
+ struct dentry * dentry = iocb->ki_filp->f_dentry;
ssize_t result;
VERBOSE("file %s/%s, count=%lu@%lu\n",
DENTRY_PATH(dentry),
- (unsigned long) count, (unsigned long) *ppos);
+ (unsigned long) count, (unsigned long) pos);
result = smb_revalidate_inode(dentry);
if (result) {
@@ -319,7 +319,7 @@ smb_file_write(struct file *file, const
goto out;
if (count > 0) {
- result = generic_file_write(file, buf, count, ppos);
+ result = generic_file_aio_write(iocb, buf, count, pos);
VERBOSE("pos=%ld, size=%ld, mtime=%ld, atime=%ld\n",
(long) file->f_pos, (long) dentry->d_inode->i_size,
dentry->d_inode->i_mtime, dentry->d_inode->i_atime);
@@ -384,8 +384,8 @@ smb_file_permission(struct inode *inode,
struct file_operations smb_file_operations =
{
.llseek = remote_llseek,
- .read = smb_file_read,
- .write = smb_file_write,
+ .aio_read = smb_file_read,
+ .aio_write = smb_file_write,
.ioctl = smb_ioctl,
.mmap = smb_file_mmap,
.open = smb_file_open,
diff -purN linux-2.5/fs/sysv/file.c aio-2.5/fs/sysv/file.c
--- linux-2.5/fs/sysv/file.c Tue Apr 1 15:17:57 2003
+++ aio-2.5/fs/sysv/file.c Mon Mar 24 11:25:05 2003
@@ -21,8 +21,8 @@
*/
struct file_operations sysv_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
.fsync = sysv_sync_file,
.sendfile = generic_file_sendfile,
diff -purN linux-2.5/fs/udf/file.c aio-2.5/fs/udf/file.c
--- linux-2.5/fs/udf/file.c Tue Apr 1 15:17:57 2003
+++ aio-2.5/fs/udf/file.c Mon Mar 24 11:25:05 2003
@@ -109,19 +109,17 @@ struct address_space_operations udf_adin
.commit_write = udf_adinicb_commit_write,
};
-static ssize_t udf_file_write(struct file * file, const char * buf,
- size_t count, loff_t *ppos)
+static ssize_t udf_file_write(struct kiocb *iocb, const char * buf,
+ size_t count, loff_t pos)
{
ssize_t retval;
- struct inode *inode = file->f_dentry->d_inode;
- int err, pos;
+ struct inode *inode = iocb->ki_filp->f_dentry->d_inode;
+ int err;
if (UDF_I_ALLOCTYPE(inode) == ICBTAG_FLAG_AD_IN_ICB)
{
- if (file->f_flags & O_APPEND)
+ if (iocb->ki_filp->f_flags & O_APPEND)
pos = inode->i_size;
- else
- pos = *ppos;
if (inode->i_sb->s_blocksize < (udf_file_entry_alloc_offset(inode) +
pos + count))
@@ -142,7 +140,7 @@ static ssize_t udf_file_write(struct fil
}
}
- retval = generic_file_write(file, buf, count, ppos);
+ retval = generic_file_aio_write(iocb, buf, count, pos);
if (retval > 0)
mark_inode_dirty(inode);
@@ -275,11 +273,11 @@ static int udf_open_file(struct inode *
}
struct file_operations udf_file_operations = {
- .read = generic_file_read,
+ .aio_read = generic_file_aio_read,
+ .aio_write = udf_file_write,
.ioctl = udf_ioctl,
.open = udf_open_file,
.mmap = generic_file_mmap,
- .write = udf_file_write,
.release = udf_release_file,
.fsync = udf_fsync_file,
.sendfile = generic_file_sendfile,
diff -purN linux-2.5/fs/ufs/file.c aio-2.5/fs/ufs/file.c
--- linux-2.5/fs/ufs/file.c Tue Apr 1 15:17:57 2003
+++ aio-2.5/fs/ufs/file.c Mon Mar 24 11:25:05 2003
@@ -43,8 +43,8 @@
struct file_operations ufs_file_operations = {
.llseek = generic_file_llseek,
- .read = generic_file_read,
- .write = generic_file_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
.open = generic_file_open,
.sendfile = generic_file_sendfile,
diff -purN linux-2.5/include/linux/aio.h aio-2.5/include/linux/aio.h
--- linux-2.5/include/linux/aio.h Tue Apr 1 15:18:08 2003
+++ aio-2.5/include/linux/aio.h Tue Mar 25 15:35:56 2003
@@ -4,6 +4,7 @@
#include <linux/list.h>
#include <linux/workqueue.h>
#include <linux/aio_abi.h>
+#include <linux/uio.h>
#include <asm/atomic.h>
@@ -61,22 +62,68 @@ struct kiocb {
void *ki_user_obj; /* pointer to userland's iocb */
__u64 ki_user_data; /* user's data for completion */
- loff_t ki_pos;
- char private[KIOCB_PRIVATE_SIZE];
+ long private[0]; /* KIOCB_PRIVATE_SIZE alloc'd */
};
-#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
+struct rw_iocb {
+ struct kiocb kiocb;
+ loff_t rw_pos;
+ unsigned long rw_nsegs;
+
+ struct iovec *rw_iov;
+ struct iovec rw_local_iov;
+
+ unsigned rw : 2, /* READ or WRITE */
+ rw_have_i_sem : 1; /* true if we hold i_sem */
+};
+
+struct fsync_iocb {
+ struct kiocb kiocb;
+ unsigned dsync;
+};
+
+struct sync_iocb {
+ union {
+ struct kiocb kiocb;
+ struct rw_iocb rw_iocb;
+ struct fsync_iocb fsync_iocb;
+ };
+ char private[KIOCB_PRIVATE_SIZE];
+};
+
+typedef union {
+ struct kiocb kiocb;
+ struct rw_iocb rw_iocb;
+ struct sync_iocb sync_iocb;
+} iocb_t;
+
+static inline struct rw_iocb *kiocb_to_rw_iocb(struct kiocb *iocb)
+{
+ return (struct rw_iocb *)iocb;
+}
+
+static inline struct fsync_iocb *kiocb_to_fsync_iocb(struct kiocb *iocb)
+{
+ return (struct fsync_iocb *)iocb;
+}
+
+static inline int is_sync_kiocb(struct kiocb *iocb)
+{
+ return iocb->ki_key == KIOCB_SYNC_KEY;
+}
+
#define init_sync_kiocb(x, filp) \
do { \
+ struct kiocb *__iocb = (x); \
struct task_struct *tsk = current; \
- (x)->ki_flags = 0; \
- (x)->ki_users = 1; \
- (x)->ki_key = KIOCB_SYNC_KEY; \
- (x)->ki_filp = (filp); \
- (x)->ki_ctx = &tsk->active_mm->default_kioctx; \
- (x)->ki_cancel = NULL; \
- (x)->ki_user_obj = tsk; \
+ __iocb->ki_flags = 0; \
+ __iocb->ki_users = 1; \
+ __iocb->ki_key = KIOCB_SYNC_KEY; \
+ __iocb->ki_filp = (filp); \
+ __iocb->ki_ctx = &tsk->active_mm->default_kioctx;\
+ __iocb->ki_cancel = NULL; \
+ __iocb->ki_user_obj = tsk; \
} while (0)
#define AIO_RING_MAGIC 0xa10a10a1
diff -purN linux-2.5/include/linux/fs.h aio-2.5/include/linux/fs.h
--- linux-2.5/include/linux/fs.h Tue Mar 25 16:38:50 2003
+++ aio-2.5/include/linux/fs.h Tue Mar 25 15:29:13 2003
@@ -21,6 +21,8 @@
#include <linux/kobject.h>
#include <asm/atomic.h>
+struct rw_iocb;
+struct fsync_iocb;
struct iovec;
struct nameidata;
struct pipe_inode_info;
@@ -305,7 +307,7 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
- int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+ int (*direct_IO)(int, struct rw_iocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
};
@@ -706,9 +708,9 @@ struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
- ssize_t (*aio_read) (struct kiocb *, char *, size_t, loff_t);
+ ssize_t (*aio_read) (struct rw_iocb *);
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
- ssize_t (*aio_write) (struct kiocb *, const char *, size_t, loff_t);
+ ssize_t (*aio_write) (struct rw_iocb *);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
@@ -717,7 +719,7 @@ struct file_operations {
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
- int (*aio_fsync) (struct kiocb *, int datasync);
+ int (*aio_fsync) (struct fsync_iocb *);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
@@ -1200,32 +1202,22 @@ extern int generic_file_mmap(struct file
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
extern int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
-extern ssize_t generic_file_read(struct file *, char *, size_t, loff_t *);
int generic_write_checks(struct inode *inode, struct file *file,
loff_t *pos, size_t *count, int isblk);
-extern ssize_t generic_file_write(struct file *, const char *, size_t, loff_t *);
-extern ssize_t generic_file_aio_read(struct kiocb *, char *, size_t, loff_t);
-extern ssize_t generic_file_aio_write(struct kiocb *, const char *, size_t, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
- unsigned long, loff_t *);
-extern ssize_t do_sync_read(struct file *filp, char *buf, size_t len, loff_t *ppos);
-extern ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos);
-ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos);
+extern ssize_t generic_file_aio_read(struct rw_iocb *);
+extern ssize_t generic_file_aio_write(struct rw_iocb *);
+extern ssize_t generic_file_aio_write_nolock(struct rw_iocb *);
+extern ssize_t FASTCALL(do_sync_rw(struct file *filp, char *buf, size_t len, loff_t *ppos, ssize_t (*op)(struct rw_iocb *), int type));
extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *);
extern void do_generic_mapping_read(struct address_space *, struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, read_actor_t);
extern void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
-extern ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb,
+extern ssize_t generic_file_direct_IO(int rw, struct rw_iocb *iocb,
const struct iovec *iov, loff_t offset, unsigned long nr_segs);
-extern int blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
+extern int blockdev_direct_IO(int rw, struct rw_iocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
unsigned long nr_segs, get_blocks_t *get_blocks);
-extern ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos);
-ssize_t generic_file_writev(struct file *filp, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos);
extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
diff -purN linux-2.5/include/net/sock.h aio-2.5/include/net/sock.h
--- linux-2.5/include/net/sock.h Tue Apr 1 15:18:12 2003
+++ aio-2.5/include/net/sock.h Mon Mar 24 17:26:21 2003
@@ -302,6 +302,7 @@ static __inline__ void sock_prot_dec_use
/* sock_iocb: used to kick off async processing of socket ios */
struct sock_iocb {
+ struct rw_iocb iocb;
struct list_head list;
int flags;
@@ -310,18 +311,23 @@ struct sock_iocb {
struct sock *sk;
struct scm_cookie *scm;
struct msghdr *msg, async_msg;
- struct iovec async_iov;
};
static inline struct sock_iocb *kiocb_to_siocb(struct kiocb *iocb)
{
- BUG_ON(sizeof(struct sock_iocb) > KIOCB_PRIVATE_SIZE);
- return (struct sock_iocb *)iocb->private;
+ BUG_ON(sizeof(struct sock_iocb) > sizeof(iocb_t));
+ return (struct sock_iocb *)iocb;
+}
+
+static inline struct sock_iocb *rw_iocb_to_siocb(struct rw_iocb *iocb)
+{
+ BUG_ON(sizeof(struct sock_iocb) > sizeof(iocb_t));
+ return (struct sock_iocb *)iocb;
}
static inline struct kiocb *siocb_to_kiocb(struct sock_iocb *si)
{
- return container_of((void *)si, struct kiocb, private);
+ return &si->iocb.kiocb;
}
struct socket_alloc {
diff -purN linux-2.5/kernel/ksyms.c aio-2.5/kernel/ksyms.c
--- linux-2.5/kernel/ksyms.c Tue Apr 1 15:18:13 2003
+++ aio-2.5/kernel/ksyms.c Mon Mar 24 12:13:45 2003
@@ -226,12 +226,9 @@ EXPORT_SYMBOL(cont_prepare_write);
EXPORT_SYMBOL(generic_commit_write);
EXPORT_SYMBOL(block_truncate_page);
EXPORT_SYMBOL(generic_block_bmap);
-EXPORT_SYMBOL(generic_file_read);
EXPORT_SYMBOL(generic_file_sendfile);
EXPORT_SYMBOL(do_generic_mapping_read);
EXPORT_SYMBOL(file_ra_state_init);
-EXPORT_SYMBOL(generic_file_write);
-EXPORT_SYMBOL(generic_file_write_nolock);
EXPORT_SYMBOL(generic_file_mmap);
EXPORT_SYMBOL(generic_file_readonly_mmap);
EXPORT_SYMBOL(generic_ro_fops);
@@ -356,8 +353,6 @@ EXPORT_SYMBOL(ioctl_by_bdev);
EXPORT_SYMBOL(read_dev_sector);
EXPORT_SYMBOL(init_buffer);
EXPORT_SYMBOL_GPL(generic_file_direct_IO);
-EXPORT_SYMBOL(generic_file_readv);
-EXPORT_SYMBOL(generic_file_writev);
EXPORT_SYMBOL(iov_shorten);
EXPORT_SYMBOL_GPL(default_backing_dev_info);
diff -purN linux-2.5/mm/filemap.c aio-2.5/mm/filemap.c
--- linux-2.5/mm/filemap.c Tue Apr 1 15:18:14 2003
+++ aio-2.5/mm/filemap.c Mon Mar 24 17:24:19 2003
@@ -711,18 +711,16 @@ success:
* This is the "read()" routine for all filesystems
* that can use the page cache directly.
*/
-static ssize_t
-__generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
+ssize_t generic_file_aio_read(struct rw_iocb *iocb)
{
- struct file *filp = iocb->ki_filp;
+ struct file *filp = iocb->kiocb.ki_filp;
ssize_t retval;
- unsigned long seg;
+ unsigned long seg, nr_segs = iocb->rw_nsegs;
size_t count;
count = 0;
for (seg = 0; seg < nr_segs; seg++) {
- const struct iovec *iv = &iov[seg];
+ const struct iovec *iv = &iocb->rw_iov[seg];
/*
* If any segment has a negative length, or the cumulative
@@ -742,7 +740,7 @@ __generic_file_aio_read(struct kiocb *io
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
- loff_t pos = *ppos, size;
+ loff_t pos = iocb->rw_pos, size;
struct address_space *mapping;
struct inode *inode;
@@ -754,11 +752,11 @@ __generic_file_aio_read(struct kiocb *io
size = inode->i_size;
if (pos < size) {
retval = generic_file_direct_IO(READ, iocb,
- iov, pos, nr_segs);
- if (retval >= 0 && !is_sync_kiocb(iocb))
+ iocb->rw_iov, pos, nr_segs);
+ if (retval >= 0 && !is_sync_kiocb(&iocb->kiocb))
retval = -EIOCBQUEUED;
if (retval > 0)
- *ppos = pos + retval;
+ iocb->rw_pos = pos + retval;
}
UPDATE_ATIME(filp->f_dentry->d_inode);
goto out;
@@ -770,12 +768,12 @@ __generic_file_aio_read(struct kiocb *io
read_descriptor_t desc;
desc.written = 0;
- desc.buf = iov[seg].iov_base;
- desc.count = iov[seg].iov_len;
+ desc.buf = iocb->rw_iov[seg].iov_base;
+ desc.count = iocb->rw_iov[seg].iov_len;
if (desc.count == 0)
continue;
desc.error = 0;
- do_generic_file_read(filp,ppos,&desc,file_read_actor);
+ do_generic_file_read(filp,&iocb->rw_pos,&desc,file_read_actor);
retval += desc.written;
if (!retval) {
retval = desc.error;
@@ -787,30 +785,8 @@ out:
return retval;
}
-ssize_t
-generic_file_aio_read(struct kiocb *iocb, char *buf, size_t count, loff_t pos)
-{
- struct iovec local_iov = { .iov_base = buf, .iov_len = count };
-
- BUG_ON(iocb->ki_pos != pos);
- return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos);
-}
EXPORT_SYMBOL(generic_file_aio_read);
-ssize_t
-generic_file_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
-{
- struct iovec local_iov = { .iov_base = buf, .iov_len = count };
- struct kiocb kiocb;
- ssize_t ret;
-
- init_sync_kiocb(&kiocb, filp);
- ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
- if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&kiocb);
- return ret;
-}
-
int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
{
ssize_t written;
@@ -1572,11 +1548,9 @@ EXPORT_SYMBOL(generic_write_checks);
* it for writing by marking it dirty.
* okir@monad.swb.de
*/
-ssize_t
-generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
+ssize_t generic_file_aio_write_nolock(struct rw_iocb *iocb)
{
- struct file *file = iocb->ki_filp;
+ struct file *file = iocb->kiocb.ki_filp;
struct address_space * mapping = file->f_dentry->d_inode->i_mapping;
struct address_space_operations *a_ops = mapping->a_ops;
size_t ocount; /* original count */
@@ -1591,14 +1565,14 @@ generic_file_aio_write_nolock(struct kio
ssize_t err;
size_t bytes;
struct pagevec lru_pvec;
- const struct iovec *cur_iov = iov; /* current iovec */
+ const struct iovec *cur_iov = iocb->rw_iov; /* current iovec */
size_t iov_base = 0; /* offset in the current iovec */
- unsigned long seg;
+ unsigned long seg, nr_segs = iocb->rw_nsegs;
char *buf;
ocount = 0;
for (seg = 0; seg < nr_segs; seg++) {
- const struct iovec *iv = &iov[seg];
+ const struct iovec *iv = &iocb->rw_iov[seg];
/*
* If any segment has a negative length, or the cumulative
@@ -1617,7 +1591,7 @@ generic_file_aio_write_nolock(struct kio
}
count = ocount;
- pos = *ppos;
+ pos = iocb->rw_pos;
pagevec_init(&lru_pvec, 0);
/* We can write back this queue in page reclaim */
@@ -1638,17 +1612,17 @@ generic_file_aio_write_nolock(struct kio
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (unlikely(file->f_flags & O_DIRECT)) {
if (count != ocount)
- nr_segs = iov_shorten((struct iovec *)iov,
+ nr_segs = iov_shorten(iocb->rw_iov,
nr_segs, count);
written = generic_file_direct_IO(WRITE, iocb,
- iov, pos, nr_segs);
+ iocb->rw_iov, pos, nr_segs);
if (written > 0) {
loff_t end = pos + written;
if (end > inode->i_size && !isblk) {
inode->i_size = end;
mark_inode_dirty(inode);
}
- *ppos = end;
+ iocb->rw_pos = end;
}
/*
* Sync the fs metadata but not the minor inode changes and
@@ -1656,12 +1630,12 @@ generic_file_aio_write_nolock(struct kio
*/
if (written >= 0 && file->f_flags & O_SYNC)
status = generic_osync_inode(inode, OSYNC_METADATA);
- if (written >= 0 && !is_sync_kiocb(iocb))
+ if (written >= 0 && !is_sync_kiocb(&iocb->kiocb))
written = -EIOCBQUEUED;
goto out_status;
}
- buf = iov->iov_base;
+ buf = iocb->rw_iov->iov_base;
do {
unsigned long index;
unsigned long offset;
@@ -1732,7 +1706,7 @@ generic_file_aio_write_nolock(struct kio
balance_dirty_pages_ratelimited(mapping);
cond_resched();
} while (count);
- *ppos = pos;
+ iocb->rw_pos = pos;
if (cached_page)
page_cache_release(cached_page);
@@ -1754,84 +1728,23 @@ out:
return err;
}
-ssize_t
-generic_file_write_nolock(struct file *file, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
-{
- struct kiocb kiocb;
- ssize_t ret;
-
- init_sync_kiocb(&kiocb, file);
- ret = generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos);
- if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&kiocb);
- return ret;
-}
-
-ssize_t generic_file_aio_write(struct kiocb *iocb, const char *buf,
- size_t count, loff_t pos)
+ssize_t generic_file_aio_write(struct rw_iocb *iocb)
{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
ssize_t err;
- struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
- BUG_ON(iocb->ki_pos != pos);
-
- down(&inode->i_sem);
- err = generic_file_aio_write_nolock(iocb, &local_iov, 1,
- &iocb->ki_pos);
- up(&inode->i_sem);
+ down(&iocb->kiocb.ki_filp->f_dentry->d_inode->i_sem);
+ err = generic_file_aio_write_nolock(iocb);
+ up(&iocb->kiocb.ki_filp->f_dentry->d_inode->i_sem);
return err;
}
EXPORT_SYMBOL(generic_file_aio_write);
EXPORT_SYMBOL(generic_file_aio_write_nolock);
-ssize_t generic_file_write(struct file *file, const char *buf,
- size_t count, loff_t *ppos)
-{
- struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
- ssize_t err;
- struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
-
- down(&inode->i_sem);
- err = generic_file_write_nolock(file, &local_iov, 1, ppos);
- up(&inode->i_sem);
-
- return err;
-}
-
-ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
-{
- struct kiocb kiocb;
- ssize_t ret;
-
- init_sync_kiocb(&kiocb, filp);
- ret = __generic_file_aio_read(&kiocb, iov, nr_segs, ppos);
- if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&kiocb);
- return ret;
-}
-
-ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
- unsigned long nr_segs, loff_t * ppos)
-{
- struct inode *inode = file->f_dentry->d_inode;
- ssize_t ret;
-
- down(&inode->i_sem);
- ret = generic_file_write_nolock(file, iov, nr_segs, ppos);
- up(&inode->i_sem);
- return ret;
-}
-
ssize_t
-generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+generic_file_direct_IO(int rw, struct rw_iocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs)
{
- struct file *file = iocb->ki_filp;
+ struct file *file = iocb->kiocb.ki_filp;
struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
ssize_t retval;
diff -purN linux-2.5/net/socket.c aio-2.5/net/socket.c
--- linux-2.5/net/socket.c Tue Apr 1 15:18:19 2003
+++ aio-2.5/net/socket.c Tue Mar 25 14:29:33 2003
@@ -95,10 +95,8 @@
#include <linux/netfilter.h>
static int sock_no_open(struct inode *irrelevant, struct file *dontcare);
-static ssize_t sock_aio_read(struct kiocb *iocb, char *buf,
- size_t size, loff_t pos);
-static ssize_t sock_aio_write(struct kiocb *iocb, const char *buf,
- size_t size, loff_t pos);
+static ssize_t sock_aio_read(struct rw_iocb *iocb);
+static ssize_t sock_aio_write(struct rw_iocb *iocb);
static int sock_mmap(struct file *file, struct vm_area_struct * vma);
static int sock_close(struct inode *inode, struct file *file);
@@ -528,6 +526,7 @@ static inline int __sock_sendmsg(struct
si->scm = NULL;
si->msg = msg;
si->size = size;
+ si->flags = 0;
err = security_socket_sendmsg(sock, msg, size);
if (err)
@@ -538,13 +537,14 @@ static inline int __sock_sendmsg(struct
int sock_sendmsg(struct socket *sock, struct msghdr *msg, int size)
{
- struct kiocb iocb;
+ struct sync_iocb iocb;
int ret;
- init_sync_kiocb(&iocb, NULL);
- ret = __sock_sendmsg(&iocb, sock, msg, size);
+ init_sync_kiocb(&iocb.kiocb, NULL);
+
+ ret = __sock_sendmsg(&iocb.kiocb, sock, msg, size);
if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&iocb);
+ ret = wait_on_sync_kiocb(&iocb.kiocb);
return ret;
}
@@ -569,13 +569,13 @@ static inline int __sock_recvmsg(struct
int sock_recvmsg(struct socket *sock, struct msghdr *msg, int size, int flags)
{
- struct kiocb iocb;
+ struct sync_iocb iocb;
int ret;
- init_sync_kiocb(&iocb, NULL);
- ret = __sock_recvmsg(&iocb, sock, msg, size, flags);
+ init_sync_kiocb(&iocb.kiocb, NULL);
+ ret = __sock_recvmsg(&iocb.kiocb, sock, msg, size, flags);
if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&iocb);
+ ret = wait_on_sync_kiocb(&iocb.kiocb);
return ret;
}
@@ -584,31 +584,29 @@ int sock_recvmsg(struct socket *sock, st
* area ubuf...ubuf+size-1 is writable before asking the protocol.
*/
-static ssize_t sock_aio_read(struct kiocb *iocb, char *ubuf,
- size_t size, loff_t pos)
+static ssize_t sock_aio_read(struct rw_iocb *iocb)
{
- struct sock_iocb *x = kiocb_to_siocb(iocb);
+ struct sock_iocb *x = rw_iocb_to_siocb(iocb);
struct socket *sock;
int flags;
- if (pos != 0)
+ if (x->iocb.rw_pos != 0)
return -ESPIPE;
- if (size==0) /* Match SYS5 behaviour */
+ if (0 == x->iocb.rw_local_iov.iov_len)
return 0;
- sock = SOCKET_I(iocb->ki_filp->f_dentry->d_inode);
+ sock = SOCKET_I(x->iocb.kiocb.ki_filp->f_dentry->d_inode);
x->async_msg.msg_name = NULL;
x->async_msg.msg_namelen = 0;
- x->async_msg.msg_iov = &x->async_iov;
- x->async_msg.msg_iovlen = 1;
+ x->async_msg.msg_iov = x->iocb.rw_iov;
+ x->async_msg.msg_iovlen = x->iocb.rw_nsegs;
x->async_msg.msg_control = NULL;
x->async_msg.msg_controllen = 0;
- x->async_iov.iov_base = ubuf;
- x->async_iov.iov_len = size;
- flags = !(iocb->ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
+ flags = !(x->iocb.kiocb.ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
- return __sock_recvmsg(iocb, sock, &x->async_msg, size, flags);
+ return __sock_recvmsg(&x->iocb.kiocb, sock, &x->async_msg,
+ x->iocb.rw_local_iov.iov_len, flags);
}
@@ -617,32 +615,29 @@ static ssize_t sock_aio_read(struct kioc
* is readable by the user process.
*/
-static ssize_t sock_aio_write(struct kiocb *iocb, const char *ubuf,
- size_t size, loff_t pos)
+static ssize_t sock_aio_write(struct rw_iocb *iocb)
{
- struct sock_iocb *x = kiocb_to_siocb(iocb);
+ struct sock_iocb *x = rw_iocb_to_siocb(iocb);
struct socket *sock;
- if (pos != 0)
+ if (x->iocb.rw_pos != 0)
return -ESPIPE;
- if(size==0) /* Match SYS5 behaviour */
+ if (0 == x->iocb.rw_local_iov.iov_len)
return 0;
- sock = SOCKET_I(iocb->ki_filp->f_dentry->d_inode);
+ sock = SOCKET_I(x->iocb.kiocb.ki_filp->f_dentry->d_inode);
x->async_msg.msg_name = NULL;
x->async_msg.msg_namelen = 0;
- x->async_msg.msg_iov = &x->async_iov;
- x->async_msg.msg_iovlen = 1;
+ x->async_msg.msg_iov = x->iocb.rw_iov;
+ x->async_msg.msg_iovlen = x->iocb.rw_nsegs;
x->async_msg.msg_control = NULL;
x->async_msg.msg_controllen = 0;
- x->async_msg.msg_flags = !(iocb->ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
+ x->async_msg.msg_flags = !(x->iocb.kiocb.ki_filp->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
if (sock->type == SOCK_SEQPACKET)
x->async_msg.msg_flags |= MSG_EOR;
- x->async_iov.iov_base = (void *)ubuf;
- x->async_iov.iov_len = size;
- return __sock_sendmsg(iocb, sock, &x->async_msg, size);
+ return __sock_sendmsg(&iocb->kiocb, sock, &x->async_msg, x->iocb.rw_local_iov.iov_len);
}
ssize_t sock_sendpage(struct file *file, struct page *page,
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Patch 2/2] Retry based aio read - filesystem read changes
2003-03-31 19:16 ` Benjamin LaHaise
2003-03-31 19:07 ` Janet Morgan
2003-03-31 19:17 ` William Lee Irwin III
@ 2003-04-07 3:51 ` Suparna Bhattacharya
2 siblings, 0 replies; 15+ messages in thread
From: Suparna Bhattacharya @ 2003-04-07 3:51 UTC (permalink / raw)
To: Benjamin LaHaise
Cc: William Lee Irwin III, Janet Morgan, akpm, linux-aio, linux-kernel
On Mon, Mar 31, 2003 at 02:16:29PM -0500, Benjamin LaHaise wrote:
> On Mon, Mar 31, 2003 at 11:11:23AM -0800, William Lee Irwin III wrote:
> > Can you tell whether these are due to hash collisions or contention on
> > the same page?
>
> No, they're most likely waiting for io to complete.
>
> To clean this up I've got a patch to move from aio_read/write with all the
> parameters to a single parameter based rw-specific iocb. That makes the
> retry for read and write more ameniable to sharing common logic akin to the
> wtd_ ops, which we need at the very least for the semaphore operations.
Do you also have a patch for handling semaphore operations ?
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2003-04-07 3:38 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-05 9:17 [RFC][Patch] Retry based aio read for filesystems Suparna Bhattacharya
2003-03-05 9:26 ` [Patch 1/2] Retry based aio read - core aio changes Suparna Bhattacharya
2003-03-14 13:23 ` Suparna Bhattacharya
2003-03-05 9:30 ` [Patch 2/2] Retry based aio read - filesystem read changes Suparna Bhattacharya
2003-03-05 10:42 ` Andrew Morton
2003-03-05 12:14 ` Suparna Bhattacharya
2003-03-31 18:32 ` Janet Morgan
2003-03-31 19:11 ` William Lee Irwin III
2003-03-31 19:16 ` Benjamin LaHaise
2003-03-31 19:07 ` Janet Morgan
2003-04-01 20:24 ` Benjamin LaHaise
2003-03-31 19:17 ` William Lee Irwin III
2003-03-31 19:25 ` Benjamin LaHaise
2003-04-07 3:51 ` Suparna Bhattacharya
2003-03-05 23:00 ` [RFC][Patch] Retry based aio read for filesystems Janet Morgan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).