All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
@ 2009-03-09 15:49 Jeff Moyer
  2009-03-09 15:54 ` [patch] factor out checks against the memlock rlimit Jeff Moyer
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-09 15:49 UTC (permalink / raw)
  To: linux-aio, zach.brown; +Cc: bcrl, Andrew Morton, linux-kernel

Hi,

Believe it or not, I get numerous questions from customers about the
suggested tuning value of aio-max-nr.  aio-max-nr limits the total
number of io events that can be reserved, system wide, for aio
completions.  Each time io_setup is called, a ring buffer is allocated
that can hold nr_events I/O completions.  That ring buffer is then
mapped into the process' address space, and the pages are pinned in
memory.  So, the reason for this upper limit (I believe) is to keep a
malicious user from pinning all of kernel memory.  Now, this sounds like
a much better job for the memlock rlimit to me, hence the following
patch.

It's worth noting that, by default, aio-max-nr was set to 65536
requests.  By default, the memlock rlimit is set to 64kb per process.
With this patch in place, a single process can specify 2045 for the
nr_events parameter of io_setup.  Or, for the worst case scenario, a
process can only create 16 io contexts, each with a single event (since
the minimum ring size is a page).

I created a simple program that sets the process' memlock rlimit and
then drops the CAP_IPC_LOCK capability (the memlock rlimit is a command
line parameter, so you can see how the nr_events allowable changes with
different limits in place).  Then, it calls io_setup/io_destroy in a
loop, increasing nr_events until it fails.  Finally, it calls io_setup
in a loop with a single event to see how many iterations it can perform
before receiving -EAGAIN.  I ran the test on a patched kernel and the
results are in line with expectations.

I also ran the aio-dio-regress regression tests, and all passed but
one.  The one that failed only failed due to an expected return code of
-ENOMEM, and instead got -EAGAIN.  Part of the patch below returns a
proper error code from aio_setup_ring.  So, I'm calling this a test
issue, but we can debate this nuance if it is deemed important.

Further, as a result of this exercise, I noticed that there are numerous
places in the kernel that test to see if locking memory will exceed the
maximum amount of allowed locked memory.  I've factored out that bit of
code in a follow-on patch that I will post.

Finally, I updated the aio man pages to reflect this change (and fix
references to aio_context_t that should be io_context_t).

You can find a copy of the test program I used at:
  http://people.redhat.com/jmoyer/aio/ioctx_alloc.c
I'll apologize in advance for the crudity of the test code.

For those reviewing the below patch, it may be worth looking at:

commit d55b5fdaf40846221d543937b786956e27837fda
Author: Zach Brown <zach.brown@oracle.com>
Date:   Mon Nov 7 00:59:31 2005 -0800

    [PATCH] aio: remove aio_max_nr accounting race

The below patch basically reverts the above commit.  There is no
accounting race now, because we are charging per-process rlimits instead
of a system-wide maximum number of aio requests.  This has the added
benefit of reducing some code complexity.

Comments and suggestions are encouraged.  I'll follow up with the other
two patches I mentioned shortly.

Thanks!
Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 8fa77e2..3bbda9d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -43,9 +43,8 @@
 #endif
 
 /*------ sysctl variables----*/
-static DEFINE_SPINLOCK(aio_nr_lock);
-unsigned long aio_nr;		/* current system wide number of aio requests */
-unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+/* current system wide number of aio requests */
+atomic_t aio_nr = ATOMIC_INIT(0);
 /*----end sysctl variables---*/
 
 static struct kmem_cache	*kiocb_cachep;
@@ -90,6 +89,7 @@ static void aio_free_ring(struct kioctx *ctx)
 	if (info->mmap_size) {
 		down_write(&ctx->mm->mmap_sem);
 		do_munmap(ctx->mm, info->mmap_base, info->mmap_size);
+		ctx->mm->locked_vm -= info->nr_pages;
 		up_write(&ctx->mm->mmap_sem);
 	}
 
@@ -106,6 +106,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 	unsigned nr_events = ctx->max_reqs;
 	unsigned long size;
 	int nr_pages;
+	unsigned long locked;
+	unsigned long lock_limit;
 
 	/* Compensate for the ring buffer's head/tail overlap entry */
 	nr_events += 2;	/* 1 is required, 2 for good luck */
@@ -130,6 +132,20 @@ static int aio_setup_ring(struct kioctx *ctx)
 	info->mmap_size = nr_pages * PAGE_SIZE;
 	dprintk("attempting mmap of %lu bytes\n", info->mmap_size);
 	down_write(&ctx->mm->mmap_sem);
+	/*
+	 * Check that the memory reserved for the completion ring does
+	 * not exceed the memlock memory limit.
+	 */
+	locked = nr_pages + ctx->mm->locked_vm;
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	lock_limit >>= PAGE_SHIFT;
+	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+		up_write(&ctx->mm->mmap_sem);
+		info->mmap_size = 0;
+		aio_free_ring(ctx);
+		return -EAGAIN;
+	}
+
 	info->mmap_base = do_mmap(NULL, 0, info->mmap_size, 
 				  PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE,
 				  0);
@@ -144,6 +160,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	info->nr_pages = get_user_pages(current, ctx->mm,
 					info->mmap_base, nr_pages, 
 					1, 0, info->ring_pages, NULL);
+	ctx->mm->locked_vm += info->nr_pages;
 	up_write(&ctx->mm->mmap_sem);
 
 	if (unlikely(info->nr_pages != nr_pages)) {
@@ -194,16 +211,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 static void ctx_rcu_free(struct rcu_head *head)
 {
 	struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
-	unsigned nr_events = ctx->max_reqs;
 
 	kmem_cache_free(kioctx_cachep, ctx);
-
-	if (nr_events) {
-		spin_lock(&aio_nr_lock);
-		BUG_ON(aio_nr - nr_events > aio_nr);
-		aio_nr -= nr_events;
-		spin_unlock(&aio_nr_lock);
-	}
 }
 
 /* __put_ioctx
@@ -217,6 +226,7 @@ static void __put_ioctx(struct kioctx *ctx)
 	cancel_delayed_work(&ctx->wq);
 	cancel_work_sync(&ctx->wq.work);
 	aio_free_ring(ctx);
+	atomic_sub(ctx->max_reqs, &aio_nr);
 	mmdrop(ctx->mm);
 	ctx->mm = NULL;
 	pr_debug("__put_ioctx: freeing %p\n", ctx);
@@ -240,7 +250,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 {
 	struct mm_struct *mm;
 	struct kioctx *ctx;
-	int did_sync = 0;
+	int ret;
 
 	/* Prevent overflows */
 	if ((nr_events > (0x10000000U / sizeof(struct io_event))) ||
@@ -249,9 +259,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 		return ERR_PTR(-EINVAL);
 	}
 
-	if ((unsigned long)nr_events > aio_max_nr)
-		return ERR_PTR(-EAGAIN);
-
 	ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
@@ -269,26 +276,11 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	INIT_LIST_HEAD(&ctx->run_list);
 	INIT_DELAYED_WORK(&ctx->wq, aio_kick_handler);
 
-	if (aio_setup_ring(ctx) < 0)
+	if ((ret = aio_setup_ring(ctx)) < 0)
 		goto out_freectx;
 
-	/* limit the number of system wide aios */
-	do {
-		spin_lock_bh(&aio_nr_lock);
-		if (aio_nr + nr_events > aio_max_nr ||
-		    aio_nr + nr_events < aio_nr)
-			ctx->max_reqs = 0;
-		else
-			aio_nr += ctx->max_reqs;
-		spin_unlock_bh(&aio_nr_lock);
-		if (ctx->max_reqs || did_sync)
-			break;
-
-		/* wait for rcu callbacks to have completed before giving up */
-		synchronize_rcu();
-		did_sync = 1;
-		ctx->max_reqs = nr_events;
-	} while (1);
+	/* update the number of system wide aios */
+	atomic_add(ctx->max_reqs, &aio_nr);  /* undone by __put_ioctx */
 
 	if (ctx->max_reqs == 0)
 		goto out_cleanup;
@@ -309,7 +301,7 @@ out_cleanup:
 out_freectx:
 	mmdrop(mm);
 	kmem_cache_free(kioctx_cachep, ctx);
-	ctx = ERR_PTR(-ENOMEM);
+	ctx = ERR_PTR(ret);
 
 	dprintk("aio: error allocating ioctx %p\n", ctx);
 	return ctx;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b16a957..09ec393 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -233,7 +233,6 @@ static inline struct kiocb *list_kiocb(struct list_head *h)
 }
 
 /* for sysctl: */
-extern unsigned long aio_nr;
-extern unsigned long aio_max_nr;
+extern atomic_t aio_nr;
 
 #endif /* __LINUX__AIO_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c5ef44f..32ced38 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1385,14 +1385,7 @@ static struct ctl_table fs_table[] = {
 		.data		= &aio_nr,
 		.maxlen		= sizeof(aio_nr),
 		.mode		= 0444,
-		.proc_handler	= &proc_doulongvec_minmax,
-	},
-	{
-		.procname	= "aio-max-nr",
-		.data		= &aio_max_nr,
-		.maxlen		= sizeof(aio_max_nr),
-		.mode		= 0644,
-		.proc_handler	= &proc_doulongvec_minmax,
+		.proc_handler	= &proc_dointvec,
 	},
 #endif /* CONFIG_AIO */
 #ifdef CONFIG_INOTIFY_USER

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [patch] factor out checks against the memlock rlimit
  2009-03-09 15:49 [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Jeff Moyer
@ 2009-03-09 15:54 ` Jeff Moyer
  2009-03-09 15:59 ` [patch] man-pages: add documentation about the memlock implications of io_setup Jeff Moyer
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-09 15:54 UTC (permalink / raw)
  To: linux-aio, bos; +Cc: zach.brown, bcrl, Andrew Morton, linux-kernel

Hi,

There are several places in the kernel where the memlock rlimit is
checked, all duplicating code.  I added another in a recent patch and it
made me feel dirty, so this patch factors all of that into a single
function, can_mlock_pages.

The infiniband implementation of the rlimit check was actually broken.
Weeding through changelogs showed that the initial implementation was
wrong, code was #if 0'd out because it didn't work, and instead of
fixing things, they just removed the code all together.  If my
assessment of that is wrong, please let me know!  ;-)

Comments, as always, are encouraged.

Cheers,
Jeff

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 6f7c096..c4d4f1b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -135,14 +135,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 
 	down_write(&current->mm->mmap_sem);
 
-	locked     = npages + current->mm->locked_vm;
-	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
-
-	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+	if (!can_mlock_pages(npages)) {
 		ret = -ENOMEM;
 		goto out;
 	}
-
+	locked = npages + current->mm->locked_vm;
 	cur_base = addr & PAGE_MASK;
 
 	ret = 0;
diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c
index 0190edc..9ab17e3 100644
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -58,10 +58,7 @@ static int __get_user_pages(unsigned long start_page, size_t num_pages,
 	size_t got;
 	int ret;
 
-	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >>
-		PAGE_SHIFT;
-
-	if (num_pages > lock_limit) {
+	if (!can_mlock_pages(num_pages)) {
 		ret = -ENOMEM;
 		goto bail;
 	}
diff --git a/fs/aio.c b/fs/aio.c
index 3bbda9d..0ca64eb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -136,10 +136,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	 * Check that the memory reserved for the completion ring does
 	 * not exceed the memlock memory limit.
 	 */
-	locked = nr_pages + ctx->mm->locked_vm;
-	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-	lock_limit >>= PAGE_SHIFT;
-	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+	if (!can_mlock_pages(nr_pages)) {
 		up_write(&ctx->mm->mmap_sem);
 		info->mmap_size = 0;
 		aio_free_ring(ctx);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..79c1127 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -737,6 +737,7 @@ extern unsigned long shmem_get_unmapped_area(struct file *file,
 #endif
 
 extern int can_do_mlock(void);
+extern int can_mlock_pages(unsigned long);
 extern int user_shm_lock(size_t, struct user_struct *);
 extern void user_shm_unlock(size_t, struct user_struct *);
 
diff --git a/mm/mlock.c b/mm/mlock.c
index cbe9e05..4ec96c2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -31,6 +31,33 @@ int can_do_mlock(void)
 }
 EXPORT_SYMBOL(can_do_mlock);
 
+/**
+ * can_mlock_pages() - tell whether mlocking nr_pages is possible
+ * @nr_pages:	Number of pages to be locked in memory
+ *
+ * Boolean function which tells whether the MEMLOCK rlimit will prevent
+ * locking nr_pages pages.
+ */
+int can_mlock_pages(unsigned long nr_pages)
+{
+	unsigned long locked, lock_limit;
+
+	if (capable(CAP_IPC_LOCK))
+		return 1;
+
+	locked = nr_pages;
+	locked += current->mm->locked_vm;
+
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	lock_limit >>= PAGE_SHIFT;
+
+	/* check against resource limits */
+	if (locked <= lock_limit)
+		return 1;
+	return 0;
+}
+EXPORT_SYMBOL(can_mlock_pages);
+
 #ifdef CONFIG_UNEVICTABLE_LRU
 /*
  * Mlocked pages are marked with PageMlocked() flag for efficient testing
@@ -505,14 +532,8 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
 	start &= PAGE_MASK;
 
-	locked = len >> PAGE_SHIFT;
-	locked += current->mm->locked_vm;
-
-	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-	lock_limit >>= PAGE_SHIFT;
-
 	/* check against resource limits */
-	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
+	if (can_mlock_pages(len >> PAGE_SHIFT))
 		error = do_mlock(start, len, 1);
 	up_write(&current->mm->mmap_sem);
 	return error;
diff --git a/mm/mmap.c b/mm/mmap.c
index 00ced3e..52d1c32 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -975,12 +975,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 
 	/* mlock MCL_FUTURE? */
 	if (vm_flags & VM_LOCKED) {
-		unsigned long locked, lock_limit;
-		locked = len >> PAGE_SHIFT;
-		locked += mm->locked_vm;
-		lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-		lock_limit >>= PAGE_SHIFT;
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+		if (!can_mlock_pages(len >> PAGE_SHIFT))
 			return -EAGAIN;
 	}
 
@@ -2006,12 +2001,7 @@ unsigned long do_brk(unsigned long addr, unsigned long len)
 	 * mlock MCL_FUTURE?
 	 */
 	if (mm->def_flags & VM_LOCKED) {
-		unsigned long locked, lock_limit;
-		locked = len >> PAGE_SHIFT;
-		locked += mm->locked_vm;
-		lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-		lock_limit >>= PAGE_SHIFT;
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+		if (!can_mlock_pages(len>>PAGE_SHIFT))
 			return -EAGAIN;
 	}
 
diff --git a/mm/mremap.c b/mm/mremap.c
index a39b7b9..6fb50e2 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -342,12 +342,8 @@ unsigned long do_mremap(unsigned long addr,
 			goto out;
 	}
 	if (vma->vm_flags & VM_LOCKED) {
-		unsigned long locked, lock_limit;
-		locked = mm->locked_vm << PAGE_SHIFT;
-		lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-		locked += new_len - old_len;
 		ret = -EAGAIN;
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+		if (!can_mlock_pages((new_len - old_len)>>PAGE_SHIFT))
 			goto out;
 	}
 	if (!may_expand_vm(mm, (new_len - old_len) >> PAGE_SHIFT)) {

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [patch] man-pages: add documentation about the memlock implications of io_setup
  2009-03-09 15:49 [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Jeff Moyer
  2009-03-09 15:54 ` [patch] factor out checks against the memlock rlimit Jeff Moyer
@ 2009-03-09 15:59 ` Jeff Moyer
  2009-03-09 16:45   ` Michael Kerrisk
  2009-03-09 16:48   ` Michael Kerrisk
  2009-03-09 16:18 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Avi Kivity
  2009-03-09 22:36 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Andrew Morton
  3 siblings, 2 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-09 15:59 UTC (permalink / raw)
  To: linux-aio, michael.kerrisk; +Cc: zach.brown, bcrl, Andrew Morton, linux-kernel

Hi,

This patch, against man-pages-3.19, adds verbiage surrounding the newly
proposed accounting of pinned aio completion ring pages against the
memlock rlimit.  It also fixes up a few references to aio_context_t
(which should be io_context_t).  Sorry for conflating the two, I can
break it apart if needed.

Cheers,
Jeff

diff -up man-pages-3.19/man2/io_cancel.2.orig man-pages-3.19/man2/io_cancel.2
--- man-pages-3.19/man2/io_cancel.2.orig	2009-03-08 11:57:50.000000000 -0400
+++ man-pages-3.19/man2/io_cancel.2	2009-03-08 11:58:01.000000000 -0400
@@ -32,7 +32,7 @@ io_cancel \- cancel an outstanding async
 .\"#include <linux/aio.h>
 .sp
 .\" .HP 16
-.BI "int io_cancel(aio_context_t " ctx_id ", struct iocb *" iocb ,
+.BI "int io_cancel(io_context_t " ctx_id ", struct iocb *" iocb ,
 .BI "              struct io_event *" result );
 .\" .ad
 .\" .hy
diff -up man-pages-3.19/man2/io_destroy.2.orig man-pages-3.19/man2/io_destroy.2
--- man-pages-3.19/man2/io_destroy.2.orig	2009-03-08 11:58:08.000000000 -0400
+++ man-pages-3.19/man2/io_destroy.2	2009-03-08 11:58:16.000000000 -0400
@@ -31,7 +31,7 @@ io_destroy \- destroy an asynchronous I/
 .\" #include <linux/aio.h>
 .sp
 .\" .HP 17
-.BI "int io_destroy(aio_context_t " ctx );
+.BI "int io_destroy(io_context_t " ctx );
 .\" .ad
 .\" .hy
 .sp
diff -up man-pages-3.19/man2/io_setup.2.orig man-pages-3.19/man2/io_setup.2
--- man-pages-3.19/man2/io_setup.2.orig	2009-03-08 11:53:39.000000000 -0400
+++ man-pages-3.19/man2/io_setup.2	2009-03-08 11:57:09.000000000 -0400
@@ -31,7 +31,7 @@ io_setup \- create an asynchronous I/O c
 .\" #include <linux/aio.h>
 .sp
 .\" .HP 15
-.BI "int io_setup(unsigned " nr_events ", aio_context_t *" ctxp );
+.BI "int io_setup(unsigned " nr_events ", io_context_t *" ctxp );
 .\" .ad
 .\" .hy
 .sp
@@ -45,7 +45,9 @@ at least \fInr_events\fP.
 \fIctxp\fP must not point to an AIO context that already exists, and must
 be initialized to 0 prior to the call.
 On successful creation of the AIO context, \fI*ctxp\fP is filled in
-with the resulting handle.
+with the resulting handle.  Memory is reserved by the kernel for the completion
+ring and, because this memory is pinned, it will be charged against the
+process's memlock rlimit.
 .SH "RETURN VALUE"
 On success,
 .BR io_setup ()
@@ -54,7 +56,8 @@ For the failure return, see NOTES.
 .SH "ERRORS"
 .TP
 .B EAGAIN
-The specified \fInr_events\fP exceeds the user's limit of available events.
+The specified \fInr_events\fP exceeds the process's limit of available events.
+Try increasing the memlock rlimit for the process.
 .TP
 .B EFAULT
 An invalid pointer is passed for \fIctxp\fP.
diff -up man-pages-3.19/man2/io_submit.2.orig man-pages-3.19/man2/io_submit.2
--- man-pages-3.19/man2/io_submit.2.orig	2009-03-08 11:58:26.000000000 -0400
+++ man-pages-3.19/man2/io_submit.2	2009-03-08 11:58:34.000000000 -0400
@@ -31,7 +31,7 @@ io_submit \- submit asynchronous I/O blo
 .\" #include <linux/aio.h>
 .sp
 .\" .HP 16
-.BI "int io_submit(aio_context_t " ctx_id ", long " nr \
+.BI "int io_submit(io_context_t " ctx_id ", long " nr \
 ", struct iocb **" iocbpp );
 .\" .ad
 .\" .hy
@@ -63,7 +63,7 @@ The file descriptor specified in the fir
 One of the data structures points to invalid data.
 .TP
 .B EINVAL
-The \fIaio_context\fP specified by \fIctx_id\fP is invalid.
+The \fIio_context\fP specified by \fIctx_id\fP is invalid.
 \fInr\fP is less than 0.
 The \fIiocb\fP at *iocbpp[0] is not properly initialized,
 or the operation specified is invalid for the file descriptor
diff -up man-pages-3.19/man7/capabilities.7.orig man-pages-3.19/man7/capabilities.7
--- man-pages-3.19/man7/capabilities.7.orig	2009-03-08 11:52:56.000000000 -0400
+++ man-pages-3.19/man7/capabilities.7	2009-03-08 11:53:14.000000000 -0400
@@ -129,7 +129,8 @@ Lock memory
 .RB ( mlock (2),
 .BR mlockall (2),
 .BR mmap (2),
-.BR shmctl (2)).
+.BR shmctl (2),
+.BR io_setup (2)).
 .TP
 .B CAP_IPC_OWNER
 Bypass permission checks for operations on System V IPC objects.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 15:49 [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Jeff Moyer
  2009-03-09 15:54 ` [patch] factor out checks against the memlock rlimit Jeff Moyer
  2009-03-09 15:59 ` [patch] man-pages: add documentation about the memlock implications of io_setup Jeff Moyer
@ 2009-03-09 16:18 ` Avi Kivity
  2009-03-09 17:57   ` Jeff Moyer
  2009-03-09 22:36 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Andrew Morton
  3 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2009-03-09 16:18 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-aio, zach.brown, bcrl, Andrew Morton, linux-kernel

Jeff Moyer wrote:
> Hi,
>
> Believe it or not, I get numerous questions from customers about the
> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
> number of io events that can be reserved, system wide, for aio
> completions.  Each time io_setup is called, a ring buffer is allocated
> that can hold nr_events I/O completions.  That ring buffer is then
> mapped into the process' address space, and the pages are pinned in
> memory.  So, the reason for this upper limit (I believe) is to keep a
> malicious user from pinning all of kernel memory.  Now, this sounds like
> a much better job for the memlock rlimit to me, hence the following
> patch.
>   

Is it not possible to get rid of the pinning entirely?  Pinning 
interferes with page migration which is important for NUMA, among other 
issues.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] man-pages: add documentation about the memlock  implications of io_setup
  2009-03-09 15:59 ` [patch] man-pages: add documentation about the memlock implications of io_setup Jeff Moyer
@ 2009-03-09 16:45   ` Michael Kerrisk
  2009-03-09 16:48   ` Michael Kerrisk
  1 sibling, 0 replies; 49+ messages in thread
From: Michael Kerrisk @ 2009-03-09 16:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-aio, michael.kerrisk, zach.brown, bcrl, Andrew Morton,
	linux-kernel

[CC+=Vegard Nossum]

Hi Jeff,

Note that my address for man-pages is mtk.manpages@gmail.com.

On Tue, Mar 10, 2009 at 4:59 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi,
>
> This patch, against man-pages-3.19, adds verbiage surrounding the newly
> proposed accounting of pinned aio completion ring pages against the
> memlock rlimit.  It also fixes up a few references to aio_context_t
> (which should be io_context_t).  Sorry for conflating the two, I can
> break it apart if needed.

Breaking apart really would be better.  Could you resubmit (to
mtk.manpages@, and CC linux-man@vger) please?

When you resubmit, could you please clarify the following:

1. For the "proposed accounting of pinned aio completion ring pages
against the memlock rlimit", can you tell me which mainline kernel
version this change (will) appear(s) in.  (You say "proposed", so that
it's not clear when/if the change will be merged into mainline.  A git
commit reference could be useful to me here.)

2. Regarding  aio_context_t vs io_context_t, Vegard Nossum also raised
this issue a while back, and I replied that the aio_context_t type is
used because:

[[
These pages attempt to document the bare syscall API.
Take a look at the kernel sources: there the type *is*
aio_context_t.
]]

Perhaps I am/was being muddle-headed here, since, after all, the pages
do say "Link with -laio."  (Once upon a time the pages actually *did*
use "io_context_t", but I changed that during a big clean-up of the
io_*.2 pages a year or two back.)  Your thoughts?  (Why do we even
have aio_context_t in the kernel versus io_context_t in libaio?)

Cheers,

Michael


> diff -up man-pages-3.19/man2/io_cancel.2.orig man-pages-3.19/man2/io_cancel.2
> --- man-pages-3.19/man2/io_cancel.2.orig        2009-03-08 11:57:50.000000000 -0400
> +++ man-pages-3.19/man2/io_cancel.2     2009-03-08 11:58:01.000000000 -0400
> @@ -32,7 +32,7 @@ io_cancel \- cancel an outstanding async
>  .\"#include <linux/aio.h>
>  .sp
>  .\" .HP 16
> -.BI "int io_cancel(aio_context_t " ctx_id ", struct iocb *" iocb ,
> +.BI "int io_cancel(io_context_t " ctx_id ", struct iocb *" iocb ,
>  .BI "              struct io_event *" result );
>  .\" .ad
>  .\" .hy
> diff -up man-pages-3.19/man2/io_destroy.2.orig man-pages-3.19/man2/io_destroy.2
> --- man-pages-3.19/man2/io_destroy.2.orig       2009-03-08 11:58:08.000000000 -0400
> +++ man-pages-3.19/man2/io_destroy.2    2009-03-08 11:58:16.000000000 -0400
> @@ -31,7 +31,7 @@ io_destroy \- destroy an asynchronous I/
>  .\" #include <linux/aio.h>
>  .sp
>  .\" .HP 17
> -.BI "int io_destroy(aio_context_t " ctx );
> +.BI "int io_destroy(io_context_t " ctx );
>  .\" .ad
>  .\" .hy
>  .sp
> diff -up man-pages-3.19/man2/io_setup.2.orig man-pages-3.19/man2/io_setup.2
> --- man-pages-3.19/man2/io_setup.2.orig 2009-03-08 11:53:39.000000000 -0400
> +++ man-pages-3.19/man2/io_setup.2      2009-03-08 11:57:09.000000000 -0400
> @@ -31,7 +31,7 @@ io_setup \- create an asynchronous I/O c
>  .\" #include <linux/aio.h>
>  .sp
>  .\" .HP 15
> -.BI "int io_setup(unsigned " nr_events ", aio_context_t *" ctxp );
> +.BI "int io_setup(unsigned " nr_events ", io_context_t *" ctxp );
>  .\" .ad
>  .\" .hy
>  .sp
> @@ -45,7 +45,9 @@ at least \fInr_events\fP.
>  \fIctxp\fP must not point to an AIO context that already exists, and must
>  be initialized to 0 prior to the call.
>  On successful creation of the AIO context, \fI*ctxp\fP is filled in
> -with the resulting handle.
> +with the resulting handle.  Memory is reserved by the kernel for the completion
> +ring and, because this memory is pinned, it will be charged against the
> +process's memlock rlimit.
>  .SH "RETURN VALUE"
>  On success,
>  .BR io_setup ()
> @@ -54,7 +56,8 @@ For the failure return, see NOTES.
>  .SH "ERRORS"
>  .TP
>  .B EAGAIN
> -The specified \fInr_events\fP exceeds the user's limit of available events.
> +The specified \fInr_events\fP exceeds the process's limit of available events.
> +Try increasing the memlock rlimit for the process.
>  .TP
>  .B EFAULT
>  An invalid pointer is passed for \fIctxp\fP.
> diff -up man-pages-3.19/man2/io_submit.2.orig man-pages-3.19/man2/io_submit.2
> --- man-pages-3.19/man2/io_submit.2.orig        2009-03-08 11:58:26.000000000 -0400
> +++ man-pages-3.19/man2/io_submit.2     2009-03-08 11:58:34.000000000 -0400
> @@ -31,7 +31,7 @@ io_submit \- submit asynchronous I/O blo
>  .\" #include <linux/aio.h>
>  .sp
>  .\" .HP 16
> -.BI "int io_submit(aio_context_t " ctx_id ", long " nr \
> +.BI "int io_submit(io_context_t " ctx_id ", long " nr \
>  ", struct iocb **" iocbpp );
>  .\" .ad
>  .\" .hy
> @@ -63,7 +63,7 @@ The file descriptor specified in the fir
>  One of the data structures points to invalid data.
>  .TP
>  .B EINVAL
> -The \fIaio_context\fP specified by \fIctx_id\fP is invalid.
> +The \fIio_context\fP specified by \fIctx_id\fP is invalid.
>  \fInr\fP is less than 0.
>  The \fIiocb\fP at *iocbpp[0] is not properly initialized,
>  or the operation specified is invalid for the file descriptor
> diff -up man-pages-3.19/man7/capabilities.7.orig man-pages-3.19/man7/capabilities.7
> --- man-pages-3.19/man7/capabilities.7.orig     2009-03-08 11:52:56.000000000 -0400
> +++ man-pages-3.19/man7/capabilities.7  2009-03-08 11:53:14.000000000 -0400
> @@ -129,7 +129,8 @@ Lock memory
>  .RB ( mlock (2),
>  .BR mlockall (2),
>  .BR mmap (2),
> -.BR shmctl (2)).
> +.BR shmctl (2),
> +.BR io_setup (2)).
>  .TP
>  .B CAP_IPC_OWNER
>  Bypass permission checks for operations on System V IPC objects.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] man-pages: add documentation about the memlock  implications of io_setup
  2009-03-09 15:59 ` [patch] man-pages: add documentation about the memlock implications of io_setup Jeff Moyer
  2009-03-09 16:45   ` Michael Kerrisk
@ 2009-03-09 16:48   ` Michael Kerrisk
  2009-03-09 20:44     ` Jeff Moyer
  1 sibling, 1 reply; 49+ messages in thread
From: Michael Kerrisk @ 2009-03-09 16:48 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-aio, michael.kerrisk, zach.brown, bcrl, Andrew Morton,
	linux-kernel, Vegard Nossum, Michael Kerrisk

[RESEND: Really CC+=Vegard Nossum, this time, and += mtk.manpages@, as well]

Hi Jeff,

Note that my address for man-pages is mtk.manpages@gmail.com.

On Tue, Mar 10, 2009 at 4:59 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi,
>
> This patch, against man-pages-3.19, adds verbiage surrounding the newly
> proposed accounting of pinned aio completion ring pages against the
> memlock rlimit.  It also fixes up a few references to aio_context_t
> (which should be io_context_t).  Sorry for conflating the two, I can
> break it apart if needed.

Breaking apart really would be better.  Could you resubmit (to
mtk.manpages@, and CC linux-man@vger) please?

When you resubmit, could you please clarify the following:

1. For the "proposed accounting of pinned aio completion ring pages
against the memlock rlimit", can you tell me which mainline kernel
version this change (will) appear(s) in.  (You say "proposed", so that
it's not clear when/if the change will be merged into mainline.  A git
commit reference could be useful to me here.)

2. Regarding  aio_context_t vs io_context_t, Vegard Nossum also raised
this issue a while back, and I replied that the aio_context_t type is
used because:

[[
These pages attempt to document the bare syscall API.
Take a look at the kernel sources: there the type *is*
aio_context_t.
]]

Perhaps I am/was being muddle-headed here, since, after all, the pages
do say "Link with -laio."  (Once upon a time the pages actually *did*
use "io_context_t", but I changed that during a big clean-up of the
io_*.2 pages a year or two back.)  Your thoughts?  (Why do we even
have aio_context_t in the kernel versus io_context_t in libaio?)

Cheers,

Michael

> diff -up man-pages-3.19/man2/io_cancel.2.orig man-pages-3.19/man2/io_cancel.2
> --- man-pages-3.19/man2/io_cancel.2.orig        2009-03-08 11:57:50.000000000 -0400
> +++ man-pages-3.19/man2/io_cancel.2     2009-03-08 11:58:01.000000000 -0400
> @@ -32,7 +32,7 @@ io_cancel \- cancel an outstanding async
>  .\"#include <linux/aio.h>
>  .sp
>  .\" .HP 16
> -.BI "int io_cancel(aio_context_t " ctx_id ", struct iocb *" iocb ,
> +.BI "int io_cancel(io_context_t " ctx_id ", struct iocb *" iocb ,
>  .BI "              struct io_event *" result );
>  .\" .ad
>  .\" .hy
> diff -up man-pages-3.19/man2/io_destroy.2.orig man-pages-3.19/man2/io_destroy.2
> --- man-pages-3.19/man2/io_destroy.2.orig       2009-03-08 11:58:08.000000000 -0400
> +++ man-pages-3.19/man2/io_destroy.2    2009-03-08 11:58:16.000000000 -0400
> @@ -31,7 +31,7 @@ io_destroy \- destroy an asynchronous I/
>  .\" #include <linux/aio.h>
>  .sp
>  .\" .HP 17
> -.BI "int io_destroy(aio_context_t " ctx );
> +.BI "int io_destroy(io_context_t " ctx );
>  .\" .ad
>  .\" .hy
>  .sp
> diff -up man-pages-3.19/man2/io_setup.2.orig man-pages-3.19/man2/io_setup.2
> --- man-pages-3.19/man2/io_setup.2.orig 2009-03-08 11:53:39.000000000 -0400
> +++ man-pages-3.19/man2/io_setup.2      2009-03-08 11:57:09.000000000 -0400
> @@ -31,7 +31,7 @@ io_setup \- create an asynchronous I/O c
>  .\" #include <linux/aio.h>
>  .sp
>  .\" .HP 15
> -.BI "int io_setup(unsigned " nr_events ", aio_context_t *" ctxp );
> +.BI "int io_setup(unsigned " nr_events ", io_context_t *" ctxp );
>  .\" .ad
>  .\" .hy
>  .sp
> @@ -45,7 +45,9 @@ at least \fInr_events\fP.
>  \fIctxp\fP must not point to an AIO context that already exists, and must
>  be initialized to 0 prior to the call.
>  On successful creation of the AIO context, \fI*ctxp\fP is filled in
> -with the resulting handle.
> +with the resulting handle.  Memory is reserved by the kernel for the completion
> +ring and, because this memory is pinned, it will be charged against the
> +process's memlock rlimit.
>  .SH "RETURN VALUE"
>  On success,
>  .BR io_setup ()
> @@ -54,7 +56,8 @@ For the failure return, see NOTES.
>  .SH "ERRORS"
>  .TP
>  .B EAGAIN
> -The specified \fInr_events\fP exceeds the user's limit of available events.
> +The specified \fInr_events\fP exceeds the process's limit of available events.
> +Try increasing the memlock rlimit for the process.
>  .TP
>  .B EFAULT
>  An invalid pointer is passed for \fIctxp\fP.
> diff -up man-pages-3.19/man2/io_submit.2.orig man-pages-3.19/man2/io_submit.2
> --- man-pages-3.19/man2/io_submit.2.orig        2009-03-08 11:58:26.000000000 -0400
> +++ man-pages-3.19/man2/io_submit.2     2009-03-08 11:58:34.000000000 -0400
> @@ -31,7 +31,7 @@ io_submit \- submit asynchronous I/O blo
>  .\" #include <linux/aio.h>
>  .sp
>  .\" .HP 16
> -.BI "int io_submit(aio_context_t " ctx_id ", long " nr \
> +.BI "int io_submit(io_context_t " ctx_id ", long " nr \
>  ", struct iocb **" iocbpp );
>  .\" .ad
>  .\" .hy
> @@ -63,7 +63,7 @@ The file descriptor specified in the fir
>  One of the data structures points to invalid data.
>  .TP
>  .B EINVAL
> -The \fIaio_context\fP specified by \fIctx_id\fP is invalid.
> +The \fIio_context\fP specified by \fIctx_id\fP is invalid.
>  \fInr\fP is less than 0.
>  The \fIiocb\fP at *iocbpp[0] is not properly initialized,
>  or the operation specified is invalid for the file descriptor
> diff -up man-pages-3.19/man7/capabilities.7.orig man-pages-3.19/man7/capabilities.7
> --- man-pages-3.19/man7/capabilities.7.orig     2009-03-08 11:52:56.000000000 -0400
> +++ man-pages-3.19/man7/capabilities.7  2009-03-08 11:53:14.000000000 -0400
> @@ -129,7 +129,8 @@ Lock memory
>  .RB ( mlock (2),
>  .BR mlockall (2),
>  .BR mmap (2),
> -.BR shmctl (2)).
> +.BR shmctl (2),
> +.BR io_setup (2)).
>  .TP
>  .B CAP_IPC_OWNER
>  Bypass permission checks for operations on System V IPC objects.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 16:18 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Avi Kivity
@ 2009-03-09 17:57   ` Jeff Moyer
  2009-03-09 19:45     ` Avi Kivity
  2009-03-09 20:31     ` Eric Dumazet
  0 siblings, 2 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-09 17:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-aio, zach.brown, bcrl, Andrew Morton, linux-kernel

Avi Kivity <avi@redhat.com> writes:

> Jeff Moyer wrote:
>> Hi,
>>
>> Believe it or not, I get numerous questions from customers about the
>> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
>> number of io events that can be reserved, system wide, for aio
>> completions.  Each time io_setup is called, a ring buffer is allocated
>> that can hold nr_events I/O completions.  That ring buffer is then
>> mapped into the process' address space, and the pages are pinned in
>> memory.  So, the reason for this upper limit (I believe) is to keep a
>> malicious user from pinning all of kernel memory.  Now, this sounds like
>> a much better job for the memlock rlimit to me, hence the following
>> patch.
>>   
>
> Is it not possible to get rid of the pinning entirely?  Pinning
> interferes with page migration which is important for NUMA, among
> other issues.

aio_complete is called from interrupt handlers, so can't block faulting
in a page.  Zach mentions there is a possibility of handing completions
off to a kernel thread, with all of the performance worries and extra
bookkeeping that go along with such a scheme (to help frame my concerns,
I often get lambasted over .5% performance regressions).

I'm happy to look into such a scheme, should anyone show me data that
points to this NUMA issue as an actual performance problem today.  In
the absence of such data, I simply can't justify the work at the moment.

Thanks for taking a look!

-Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 17:57   ` Jeff Moyer
@ 2009-03-09 19:45     ` Avi Kivity
  2009-03-09 20:36       ` Jamie Lokier
  2009-03-09 20:31     ` Eric Dumazet
  1 sibling, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2009-03-09 19:45 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-aio, zach.brown, bcrl, Andrew Morton, linux-kernel

Jeff Moyer wrote:
>> Is it not possible to get rid of the pinning entirely?  Pinning
>> interferes with page migration which is important for NUMA, among
>> other issues.
>>     
>
> aio_complete is called from interrupt handlers, so can't block faulting
> in a page.  Zach mentions there is a possibility of handing completions
> off to a kernel thread, with all of the performance worries and extra
> bookkeeping that go along with such a scheme (to help frame my concerns,
> I often get lambasted over .5% performance regressions).
>   

Or you could queue the completions somewhere, and only copy them to user 
memory when io_getevents() is called.  I think the plan was once to 
allow events to be consumed opportunistically even without 
io_getevents(), though.


> I'm happy to look into such a scheme, should anyone show me data that
> points to this NUMA issue as an actual performance problem today.  In
> the absence of such data, I simply can't justify the work at the moment.
>   

Right now page migration is a dead duck.  Outside HPC, there is now 
support for triggering it or for getting the scheduler to prefer a 
process's memory node.  Only a minority of hosts are NUMA.

I think that will/should change in the near future.  Nehalem-based 
servers mean that NUMA will be commonplace.  The larger core counts will 
mean that hosts will run several unrelated applications (often through 
virtualization); such partitioning can easily benefit from page migration.

> Thanks for taking a look!
>   

Sorry, I didn't actually take a look at the patches.  I only reacted to 
the description - I am allergic to pinned memory.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 17:57   ` Jeff Moyer
  2009-03-09 19:45     ` Avi Kivity
@ 2009-03-09 20:31     ` Eric Dumazet
  2009-03-12  2:39       ` Eric Dumazet
  1 sibling, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-09 20:31 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Avi Kivity, linux-aio, zach.brown, bcrl, Andrew Morton, linux-kernel

Jeff Moyer a écrit :
> Avi Kivity <avi@redhat.com> writes:
> 
>> Jeff Moyer wrote:
>>> Hi,
>>>
>>> Believe it or not, I get numerous questions from customers about the
>>> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
>>> number of io events that can be reserved, system wide, for aio
>>> completions.  Each time io_setup is called, a ring buffer is allocated
>>> that can hold nr_events I/O completions.  That ring buffer is then
>>> mapped into the process' address space, and the pages are pinned in
>>> memory.  So, the reason for this upper limit (I believe) is to keep a
>>> malicious user from pinning all of kernel memory.  Now, this sounds like
>>> a much better job for the memlock rlimit to me, hence the following
>>> patch.
>>>   
>> Is it not possible to get rid of the pinning entirely?  Pinning
>> interferes with page migration which is important for NUMA, among
>> other issues.
> 
> aio_complete is called from interrupt handlers, so can't block faulting
> in a page.  Zach mentions there is a possibility of handing completions
> off to a kernel thread, with all of the performance worries and extra
> bookkeeping that go along with such a scheme (to help frame my concerns,
> I often get lambasted over .5% performance regressions).

This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
instead of call_rcu() for "struct file" freeing.

http://lkml.org/lkml/2008/12/17/364

I would love it we could get rid of this mess...



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 19:45     ` Avi Kivity
@ 2009-03-09 20:36       ` Jamie Lokier
  2009-03-10  8:36         ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Jamie Lokier @ 2009-03-09 20:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeff Moyer, linux-aio, zach.brown, bcrl, Andrew Morton, linux-kernel

Avi Kivity wrote:
> Or you could queue the completions somewhere, and only copy them to
> user memory when io_getevents() is called.  I think the plan was
> once to allow events to be consumed opportunistically even without
> io_getevents(), though.

Isn't that integrated (or to be integrated) with the vring buffer
thingy used by virtio?

If not, should it be?

> Sorry, I didn't actually take a look at the patches.  I only reacted to 
> the description - I am allergic to pinned memory.

While we're at it I'm allergic to fixed size ring buffers :-)

-- Jamie

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] man-pages: add documentation about the memlock  implications of io_setup
  2009-03-09 16:48   ` Michael Kerrisk
@ 2009-03-09 20:44     ` Jeff Moyer
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-09 20:44 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-aio, michael.kerrisk, zach.brown, bcrl, Andrew Morton,
	linux-kernel, Vegard Nossum, Michael Kerrisk

Michael Kerrisk <michael.kerrisk@googlemail.com> writes:

> [RESEND: Really CC+=Vegard Nossum, this time, and += mtk.manpages@, as well]
>
> Hi Jeff,
>
> Note that my address for man-pages is mtk.manpages@gmail.com.

Sorry about that!

> On Tue, Mar 10, 2009 at 4:59 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Hi,
>>
>> This patch, against man-pages-3.19, adds verbiage surrounding the newly
>> proposed accounting of pinned aio completion ring pages against the
>> memlock rlimit.  It also fixes up a few references to aio_context_t
>> (which should be io_context_t).  Sorry for conflating the two, I can
>> break it apart if needed.
>
> Breaking apart really would be better.  Could you resubmit (to
> mtk.manpages@, and CC linux-man@vger) please?

Sure thing.  I'll address your question when I resubmit.

Thanks!

Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 15:49 [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Jeff Moyer
                   ` (2 preceding siblings ...)
  2009-03-09 16:18 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Avi Kivity
@ 2009-03-09 22:36 ` Andrew Morton
  2009-03-10 13:43   ` Jeff Moyer
  3 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2009-03-09 22:36 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-aio, zach.brown, bcrl, linux-kernel

On Mon, 09 Mar 2009 11:49:57 -0400
Jeff Moyer <jmoyer@redhat.com> wrote:

> Hi,
> 
> Believe it or not, I get numerous questions from customers about the
> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
> number of io events that can be reserved, system wide, for aio
> completions.  Each time io_setup is called, a ring buffer is allocated
> that can hold nr_events I/O completions.  That ring buffer is then
> mapped into the process' address space, and the pages are pinned in
> memory.  So, the reason for this upper limit (I believe) is to keep a
> malicious user from pinning all of kernel memory.  Now, this sounds like
> a much better job for the memlock rlimit to me, hence the following
> patch.
> 
> It's worth noting that, by default, aio-max-nr was set to 65536
> requests.  By default, the memlock rlimit is set to 64kb per process.
> With this patch in place, a single process can specify 2045 for the
> nr_events parameter of io_setup.  Or, for the worst case scenario, a
> process can only create 16 io contexts, each with a single event (since
> the minimum ring size is a page).
> 
> I created a simple program that sets the process' memlock rlimit and
> then drops the CAP_IPC_LOCK capability (the memlock rlimit is a command
> line parameter, so you can see how the nr_events allowable changes with
> different limits in place).  Then, it calls io_setup/io_destroy in a
> loop, increasing nr_events until it fails.  Finally, it calls io_setup
> in a loop with a single event to see how many iterations it can perform
> before receiving -EAGAIN.  I ran the test on a patched kernel and the
> results are in line with expectations.
> 
> I also ran the aio-dio-regress regression tests, and all passed but
> one.  The one that failed only failed due to an expected return code of
> -ENOMEM, and instead got -EAGAIN.  Part of the patch below returns a
> proper error code from aio_setup_ring.  So, I'm calling this a test
> issue, but we can debate this nuance if it is deemed important.
> 
> Further, as a result of this exercise, I noticed that there are numerous
> places in the kernel that test to see if locking memory will exceed the
> maximum amount of allowed locked memory.  I've factored out that bit of
> code in a follow-on patch that I will post.
> 
> Finally, I updated the aio man pages to reflect this change (and fix
> references to aio_context_t that should be io_context_t).
> 
> You can find a copy of the test program I used at:
>   http://people.redhat.com/jmoyer/aio/ioctx_alloc.c
> I'll apologize in advance for the crudity of the test code.
> 
> For those reviewing the below patch, it may be worth looking at:
> 
> commit d55b5fdaf40846221d543937b786956e27837fda
> Author: Zach Brown <zach.brown@oracle.com>
> Date:   Mon Nov 7 00:59:31 2005 -0800
> 
>     [PATCH] aio: remove aio_max_nr accounting race
> 
> The below patch basically reverts the above commit.  There is no
> accounting race now, because we are charging per-process rlimits instead
> of a system-wide maximum number of aio requests.  This has the added
> benefit of reducing some code complexity.

It's risky to simply remove an existing tunable.  What if someone's
mission-critical startup script which is provided by a super-slow or
even no-longer-in-business vendor does 

	if (write(aoi-max-nr, something) == error)
		crash_and_burn()

?

It would be prudent to have a more cautious update scheme.  Leave the
existing tunable in place.  Keep it working if possible.  If someone
uses it, blurt out a loud printk telling them that they're using a
deprecated interface and informing them how to update.

Then at some later time we can remove the old interface.

>  /*------ sysctl variables----*/
> -static DEFINE_SPINLOCK(aio_nr_lock);
> -unsigned long aio_nr;		/* current system wide number of aio requests */
> -unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
> +/* current system wide number of aio requests */
> +atomic_t aio_nr = ATOMIC_INIT(0);

Was it a conscious decision to downgrade this from a `long' type to an
`int' type?  It _could_ have used atomic_long_t.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 20:36       ` Jamie Lokier
@ 2009-03-10  8:36         ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2009-03-10  8:36 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jeff Moyer, linux-aio, zach.brown, bcrl, Andrew Morton, linux-kernel

Jamie Lokier wrote:
> Avi Kivity wrote:
>   
>> Or you could queue the completions somewhere, and only copy them to
>> user memory when io_getevents() is called.  I think the plan was
>> once to allow events to be consumed opportunistically even without
>> io_getevents(), though.
>>     
>
> Isn't that integrated (or to be integrated) with the vring buffer
> thingy used by virtio?
>
> If not, should it be?
>
>   

vringfd has been abandoned, AFAICT.  I think it's too similar to aio anyway.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 22:36 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Andrew Morton
@ 2009-03-10 13:43   ` Jeff Moyer
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-10 13:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-aio, zach.brown, bcrl, linux-kernel

Andrew Morton <akpm@linux-foundation.org> writes:

> It's risky to simply remove an existing tunable.  What if someone's
> mission-critical startup script which is provided by a super-slow or
> even no-longer-in-business vendor does 
>
> 	if (write(aoi-max-nr, something) == error)
> 		crash_and_burn()
>
> ?
>
> It would be prudent to have a more cautious update scheme.  Leave the
> existing tunable in place.  Keep it working if possible.  If someone
> uses it, blurt out a loud printk telling them that they're using a
> deprecated interface and informing them how to update.
>
> Then at some later time we can remove the old interface.

You are absolutely right.  The more I think about this, the less
enthused I am about it.  I still believe the change is the right way to
go, but there are years worth of deployments using the current scheme,
and there are existing documents detailing how to work within that
scheme.  I hereby rescind this patch.

I will follow-up with the memlock cleanup, though;  let me know if you
have comments on that patch.  Zach mentioned that he hates my new
function name (can_mlock_pages), so I guess at least that will change.
;)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-09 20:31     ` Eric Dumazet
@ 2009-03-12  2:39       ` Eric Dumazet
  2009-03-12  2:44         ` Benjamin LaHaise
  2009-03-12  3:09         ` Eric Dumazet
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  2:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi

Eric Dumazet a écrit :
> Jeff Moyer a écrit :
>> Avi Kivity <avi@redhat.com> writes:
>>
>>> Jeff Moyer wrote:
>>>> Hi,
>>>>
>>>> Believe it or not, I get numerous questions from customers about the
>>>> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
>>>> number of io events that can be reserved, system wide, for aio
>>>> completions.  Each time io_setup is called, a ring buffer is allocated
>>>> that can hold nr_events I/O completions.  That ring buffer is then
>>>> mapped into the process' address space, and the pages are pinned in
>>>> memory.  So, the reason for this upper limit (I believe) is to keep a
>>>> malicious user from pinning all of kernel memory.  Now, this sounds like
>>>> a much better job for the memlock rlimit to me, hence the following
>>>> patch.
>>>>   
>>> Is it not possible to get rid of the pinning entirely?  Pinning
>>> interferes with page migration which is important for NUMA, among
>>> other issues.
>> aio_complete is called from interrupt handlers, so can't block faulting
>> in a page.  Zach mentions there is a possibility of handing completions
>> off to a kernel thread, with all of the performance worries and extra
>> bookkeeping that go along with such a scheme (to help frame my concerns,
>> I often get lambasted over .5% performance regressions).
> 
> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
> instead of call_rcu() for "struct file" freeing.
> 
> http://lkml.org/lkml/2008/12/17/364
> 
> I would love if we could get rid of this mess...

Speaking of that, I tried to take a look at this aio stuff and have one question.

Assuming that __fput() cannot be called from interrupt context.
-> fput() should not be called from interrupt context as well.

How comes we call fput(req->ki_eventfd) from really_put_req()
from interrupt context ?

If user program closes eventfd, then inflight AIO requests can trigger
a bug.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  2:39       ` Eric Dumazet
@ 2009-03-12  2:44         ` Benjamin LaHaise
  2009-03-12  3:24           ` Eric Dumazet
  2009-03-12  3:09         ` Eric Dumazet
  1 sibling, 1 reply; 49+ messages in thread
From: Benjamin LaHaise @ 2009-03-12  2:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	linux-kernel, Davide Libenzi

On Thu, Mar 12, 2009 at 03:39:51AM +0100, Eric Dumazet wrote:
> Assuming that __fput() cannot be called from interrupt context.
> -> fput() should not be called from interrupt context as well.

It doesn't get called in irq context.  The atomic dec is done in irq 
context, and if that is the last user of the file descriptor, it's 
placed on a work queue and then released in process context.

		-ben

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  2:39       ` Eric Dumazet
  2009-03-12  2:44         ` Benjamin LaHaise
@ 2009-03-12  3:09         ` Eric Dumazet
  2009-03-12  5:18           ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
  1 sibling, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  3:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi, Christoph Lameter

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Jeff Moyer a écrit :
>>> Avi Kivity <avi@redhat.com> writes:
>>>
>>>> Jeff Moyer wrote:
>>>>> Hi,
>>>>>
>>>>> Believe it or not, I get numerous questions from customers about the
>>>>> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
>>>>> number of io events that can be reserved, system wide, for aio
>>>>> completions.  Each time io_setup is called, a ring buffer is allocated
>>>>> that can hold nr_events I/O completions.  That ring buffer is then
>>>>> mapped into the process' address space, and the pages are pinned in
>>>>> memory.  So, the reason for this upper limit (I believe) is to keep a
>>>>> malicious user from pinning all of kernel memory.  Now, this sounds like
>>>>> a much better job for the memlock rlimit to me, hence the following
>>>>> patch.
>>>>>   
>>>> Is it not possible to get rid of the pinning entirely?  Pinning
>>>> interferes with page migration which is important for NUMA, among
>>>> other issues.
>>> aio_complete is called from interrupt handlers, so can't block faulting
>>> in a page.  Zach mentions there is a possibility of handing completions
>>> off to a kernel thread, with all of the performance worries and extra
>>> bookkeeping that go along with such a scheme (to help frame my concerns,
>>> I often get lambasted over .5% performance regressions).
>> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
>> instead of call_rcu() for "struct file" freeing.
>>
>> http://lkml.org/lkml/2008/12/17/364
>>
>> I would love if we could get rid of this mess...
> 
> Speaking of that, I tried to take a look at this aio stuff and have one question.
> 
> Assuming that __fput() cannot be called from interrupt context.
> -> fput() should not be called from interrupt context as well.
> 
> How comes we call fput(req->ki_eventfd) from really_put_req()
> from interrupt context ?
> 
> If user program closes eventfd, then inflight AIO requests can trigger
> a bug.
> 

Path could be :

1) fput() changes so that calling it from interrupt context is possible
   (Using a working queue to make sure __fput() is called from process context)

2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)

3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
   SLAB_DESTROY_BY_RCU for "struct file" get back :)



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  2:44         ` Benjamin LaHaise
@ 2009-03-12  3:24           ` Eric Dumazet
  2009-03-12  3:29             ` Benjamin LaHaise
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  3:24 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	linux-kernel, Davide Libenzi

Benjamin LaHaise a écrit :
> On Thu, Mar 12, 2009 at 03:39:51AM +0100, Eric Dumazet wrote:
>> Assuming that __fput() cannot be called from interrupt context.
>> -> fput() should not be called from interrupt context as well.
> 
> It doesn't get called in irq context.  The atomic dec is done in irq 
> context, and if that is the last user of the file descriptor, it's 
> placed on a work queue and then released in process context.
> 

really_put_req() itself can be called from interrupt context, unless I am wrong.
(its 4:24 am here ;) )

        if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
                get_ioctx(ctx);
                spin_lock(&fput_lock);
                list_add(&req->ki_list, &fput_head);
                spin_unlock(&fput_lock);
                queue_work(aio_wq, &fput_work);
        } else
                really_put_req(ctx, req); /* from interrupt context */



static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
{
        assert_spin_locked(&ctx->ctx_lock);

        if (!IS_ERR(req->ki_eventfd))
                fput(req->ki_eventfd);   /* BANG : can be called from interrupt context */
        if (req->ki_dtor)
                req->ki_dtor(req);
        if (req->ki_iovec != &req->ki_inline_vec)
                kfree(req->ki_iovec);
        kmem_cache_free(kiocb_cachep, req);
        ctx->reqs_active--;

        if (unlikely(!ctx->reqs_active && ctx->dead))
                wake_up(&ctx->wait);
}

Thank you


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  3:24           ` Eric Dumazet
@ 2009-03-12  3:29             ` Benjamin LaHaise
  2009-03-12  3:33               ` Eric Dumazet
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin LaHaise @ 2009-03-12  3:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	linux-kernel, Davide Libenzi

On Thu, Mar 12, 2009 at 04:24:42AM +0100, Eric Dumazet wrote:
>         if (!IS_ERR(req->ki_eventfd))
>                 fput(req->ki_eventfd);   /* BANG : can be called from interrupt context */
...
> Thank you

That's a bug in the eventfd code, not aio.  Davide: please fix.

		-ben

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  3:29             ` Benjamin LaHaise
@ 2009-03-12  3:33               ` Eric Dumazet
  2009-03-12  3:36                 ` Benjamin LaHaise
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  3:33 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	linux-kernel, Davide Libenzi

Benjamin LaHaise a écrit :
> On Thu, Mar 12, 2009 at 04:24:42AM +0100, Eric Dumazet wrote:
>>         if (!IS_ERR(req->ki_eventfd))
>>                 fput(req->ki_eventfd);   /* BANG : can be called from interrupt context */
> ...
>> Thank you
> 
> That's a bug in the eventfd code, not aio.  Davide: please fix.
> 

Hmm... what about fget_light() ... is it Davide fault too ?

aio breaks the fget_light() concept too, if process is mono threaded.

/*
 * Lightweight file lookup - no refcnt increment if fd table isn't shared.
 * You can use this only if it is guranteed that the current task already
 * holds a refcnt to that file. That check has to be done at fget() only
 * and a flag is returned to be passed to the corresponding fput_light().
 * There must not be a cloning between an fget_light/fput_light pair.
 */
struct file *fget_light(unsigned int fd, int *fput_needed)
{
        struct file *file;
        struct files_struct *files = current->files;

        *fput_needed = 0;
        if (likely((atomic_read(&files->count) == 1))) {
                file = fcheck_files(files, fd);
        } else {
                rcu_read_lock();
                file = fcheck_files(files, fd);
                if (file) {
                        if (atomic_long_inc_not_zero(&file->f_count))
                                *fput_needed = 1;
                        else
                                /* Didn't get the reference, someone's freed */
                                file = NULL;
                }
                rcu_read_unlock();
        }

        return file;
}




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  3:33               ` Eric Dumazet
@ 2009-03-12  3:36                 ` Benjamin LaHaise
  2009-03-12  3:40                   ` Eric Dumazet
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin LaHaise @ 2009-03-12  3:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	linux-kernel, Davide Libenzi

On Thu, Mar 12, 2009 at 04:33:20AM +0100, Eric Dumazet wrote:
> aio breaks the fget_light() concept too, if process is mono threaded.

AIO requests cannot use fget_light().  The user space program could very 
well use the close() syscall to destroy the file descriptor while an io 
request is in flight, so you cannot avoid the reference counting overhead.

		-ben

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring
  2009-03-12  3:36                 ` Benjamin LaHaise
@ 2009-03-12  3:40                   ` Eric Dumazet
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  3:40 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	linux-kernel, Davide Libenzi

Benjamin LaHaise a écrit :
> On Thu, Mar 12, 2009 at 04:33:20AM +0100, Eric Dumazet wrote:
>> aio breaks the fget_light() concept too, if process is mono threaded.
> 
> AIO requests cannot use fget_light().  The user space program could very 
> well use the close() syscall to destroy the file descriptor while an io 
> request is in flight, so you cannot avoid the reference counting overhead.

Yes, you are right, AIO doesnt break fget_light()



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] fs: fput() can be called from interrupt context
  2009-03-12  3:09         ` Eric Dumazet
@ 2009-03-12  5:18           ` Eric Dumazet
  2009-03-12  5:42             ` [PATCH] aio: " Eric Dumazet
  2009-03-12  5:47             ` [PATCH] fs: " Andrew Morton
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  5:18 UTC (permalink / raw)
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	bcrl, linux-kernel, Davide Libenzi, Christoph Lameter

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Eric Dumazet a écrit :
>>> Jeff Moyer a écrit :
>>>> Avi Kivity <avi@redhat.com> writes:
>>>>
>>>>> Jeff Moyer wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Believe it or not, I get numerous questions from customers about the
>>>>>> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
>>>>>> number of io events that can be reserved, system wide, for aio
>>>>>> completions.  Each time io_setup is called, a ring buffer is allocated
>>>>>> that can hold nr_events I/O completions.  That ring buffer is then
>>>>>> mapped into the process' address space, and the pages are pinned in
>>>>>> memory.  So, the reason for this upper limit (I believe) is to keep a
>>>>>> malicious user from pinning all of kernel memory.  Now, this sounds like
>>>>>> a much better job for the memlock rlimit to me, hence the following
>>>>>> patch.
>>>>>>   
>>>>> Is it not possible to get rid of the pinning entirely?  Pinning
>>>>> interferes with page migration which is important for NUMA, among
>>>>> other issues.
>>>> aio_complete is called from interrupt handlers, so can't block faulting
>>>> in a page.  Zach mentions there is a possibility of handing completions
>>>> off to a kernel thread, with all of the performance worries and extra
>>>> bookkeeping that go along with such a scheme (to help frame my concerns,
>>>> I often get lambasted over .5% performance regressions).
>>> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
>>> instead of call_rcu() for "struct file" freeing.
>>>
>>> http://lkml.org/lkml/2008/12/17/364
>>>
>>> I would love if we could get rid of this mess...
>> Speaking of that, I tried to take a look at this aio stuff and have one question.
>>
>> Assuming that __fput() cannot be called from interrupt context.
>> -> fput() should not be called from interrupt context as well.
>>
>> How comes we call fput(req->ki_eventfd) from really_put_req()
>> from interrupt context ?
>>
>> If user program closes eventfd, then inflight AIO requests can trigger
>> a bug.
>>
> 
> Path could be :
> 
> 1) fput() changes so that calling it from interrupt context is possible
>    (Using a working queue to make sure __fput() is called from process context)
> 
> 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
> 
> 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
>    SLAB_DESTROY_BY_RCU for "struct file" get back :)
> 

Please find first patch against linux-2.6

Next patch (2) can cleanup aio code, but it probably can wait linux-2.6.30

Thank you

[PATCH] fs: fput() can be called from interrupt context

Current aio/eventfd code can call fput() from interrupt context, which is
not allowed.

In order to fix the problem and prepare SLAB_DESTROY_BY_RCU use for "struct file"
allocation/freeing in 2.6.30, we might extend existing workqueue infrastructure and
allow fput() to be called from interrupt context.

This unfortunalty adds a pointer to 'struct file'.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/file.c               |   55 ++++++++++++++++++++++++++------------
 fs/file_table.c         |   10 +++++-
 include/linux/fdtable.h |    1
 include/linux/fs.h      |    1
 4 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index f313314..8c819a8 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -24,6 +24,7 @@ struct fdtable_defer {
 	spinlock_t lock;
 	struct work_struct wq;
 	struct fdtable *next;
+	struct file *fhead;
 };
 
 int sysctl_nr_open __read_mostly = 1024*1024;
@@ -67,24 +68,53 @@ static void free_fdtable_work(struct work_struct *work)
 	struct fdtable_defer *f =
 		container_of(work, struct fdtable_defer, wq);
 	struct fdtable *fdt;
+	struct file *fhead;
 
-	spin_lock_bh(&f->lock);
-	fdt = f->next;
-	f->next = NULL;
-	spin_unlock_bh(&f->lock);
-	while(fdt) {
+	spin_lock_irq(&f->lock);
+	fdt      = f->next;
+	fhead    = f->fhead;
+	f->next  = NULL;
+	f->fhead = NULL;
+	spin_unlock_irq(&f->lock);
+
+	while (fdt) {
 		struct fdtable *next = fdt->next;
+
 		vfree(fdt->fd);
 		free_fdset(fdt);
 		kfree(fdt);
 		fdt = next;
 	}
+
+	while (fhead) {
+		struct file *next = fhead->f_next;
+
+		__fput(fhead);
+		fhead = next;
+	}
+}
+
+void fd_defer_queue(struct fdtable *fdt, struct file *file)
+{
+	struct fdtable_defer *fddef = &get_cpu_var(fdtable_defer_list);
+
+	spin_lock_irq(&fddef->lock);
+	if (fdt) {
+		fdt->next = fddef->next;
+		fddef->next = fdt;
+	}
+	if (file) {
+		file->f_next = fddef->fhead;
+		fddef->fhead = file;
+	}
+	spin_unlock_irq(&fddef->lock);
+	schedule_work(&fddef->wq);
+	put_cpu_var(fdtable_defer_list);
 }
 
 void free_fdtable_rcu(struct rcu_head *rcu)
 {
 	struct fdtable *fdt = container_of(rcu, struct fdtable, rcu);
-	struct fdtable_defer *fddef;
 
 	BUG_ON(!fdt);
 
@@ -101,16 +131,8 @@ void free_fdtable_rcu(struct rcu_head *rcu)
 		kfree(fdt->fd);
 		kfree(fdt->open_fds);
 		kfree(fdt);
-	} else {
-		fddef = &get_cpu_var(fdtable_defer_list);
-		spin_lock(&fddef->lock);
-		fdt->next = fddef->next;
-		fddef->next = fdt;
-		/* vmallocs are handled from the workqueue context */
-		schedule_work(&fddef->wq);
-		spin_unlock(&fddef->lock);
-		put_cpu_var(fdtable_defer_list);
-	}
+	} else
+		fd_defer_queue(fdt, NULL);
 }
 
 /*
@@ -410,6 +432,7 @@ static void __devinit fdtable_defer_list_init(int cpu)
 	spin_lock_init(&fddef->lock);
 	INIT_WORK(&fddef->wq, free_fdtable_work);
 	fddef->next = NULL;
+	fddef->fhead = NULL;
 }
 
 void __init files_defer_init(void)
diff --git a/fs/file_table.c b/fs/file_table.c
index bbeeac6..77f42d8 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -21,6 +21,7 @@
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
 #include <linux/percpu_counter.h>
+#include <linux/interrupt.h>
 
 #include <asm/atomic.h>
 
@@ -220,10 +221,15 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
 }
 EXPORT_SYMBOL(init_file);
 
+
 void fput(struct file *file)
 {
-	if (atomic_long_dec_and_test(&file->f_count))
-		__fput(file);
+	if (atomic_long_dec_and_test(&file->f_count)) {
+		if (unlikely(!in_interrupt()))
+			fd_defer_queue(NULL, file);
+		else
+			__fput(file);
+	}
 }
 
 EXPORT_SYMBOL(fput);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 09d6c5b..42e5e8e 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -35,6 +35,7 @@ struct fdtable {
 	struct fdtable *next;
 };
 
+extern void fd_defer_queue(struct fdtable *fdt, struct file *file);
 /*
  * Open file table structure
  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92734c0..94beb0e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -872,6 +872,7 @@ struct file {
 #ifdef CONFIG_DEBUG_WRITECOUNT
 	unsigned long f_mnt_write_state;
 #endif
+	struct file		*f_next;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH] aio: fput() can be called from interrupt context
  2009-03-12  5:18           ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
@ 2009-03-12  5:42             ` Eric Dumazet
  2009-03-12  5:47             ` [PATCH] fs: " Andrew Morton
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  5:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi, Christoph Lameter

Eric Dumazet a écrit :
>> Path could be :
>>
>> 1) fput() changes so that calling it from interrupt context is possible
>>    (Using a working queue to make sure __fput() is called from process context)
>>
>> 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
>>
>> 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
>>    SLAB_DESTROY_BY_RCU for "struct file" get back :)
>>

Here is the second patch

Thank you

[PATCH] aio: cleanup, since fput() is IRQ safe

Once fput() is IRQ safe, we can cleanup aio code and delete its work_queue.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/aio.c |   52 ++--------------------------------------------------
 1 files changed, 2 insertions(+), 50 deletions(-)


diff --git a/fs/aio.c b/fs/aio.c
index 8fa77e2..b0351a1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -53,13 +53,6 @@ static struct kmem_cache	*kioctx_cachep;
 
 static struct workqueue_struct *aio_wq;
 
-/* Used for rare fput completion. */
-static void aio_fput_routine(struct work_struct *);
-static DECLARE_WORK(fput_work, aio_fput_routine);
-
-static DEFINE_SPINLOCK(fput_lock);
-static LIST_HEAD(fput_head);
-
 static void aio_kick_handler(struct work_struct *);
 static void aio_queue_work(struct kioctx *);
 
@@ -469,15 +462,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
 static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req;
-	/* Handle a potential starvation case -- should be exceedingly rare as 
-	 * requests will be stuck on fput_head only if the aio_fput_routine is 
-	 * delayed and the requests were the last user of the struct file.
-	 */
 	req = __aio_get_req(ctx);
-	if (unlikely(NULL == req)) {
-		aio_fput_routine(NULL);
-		req = __aio_get_req(ctx);
-	}
 	return req;
 }
 
@@ -498,30 +483,6 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 		wake_up(&ctx->wait);
 }
 
-static void aio_fput_routine(struct work_struct *data)
-{
-	spin_lock_irq(&fput_lock);
-	while (likely(!list_empty(&fput_head))) {
-		struct kiocb *req = list_kiocb(fput_head.next);
-		struct kioctx *ctx = req->ki_ctx;
-
-		list_del(&req->ki_list);
-		spin_unlock_irq(&fput_lock);
-
-		/* Complete the fput */
-		__fput(req->ki_filp);
-
-		/* Link the iocb into the context's free list */
-		spin_lock_irq(&ctx->ctx_lock);
-		really_put_req(ctx, req);
-		spin_unlock_irq(&ctx->ctx_lock);
-
-		put_ioctx(ctx);
-		spin_lock_irq(&fput_lock);
-	}
-	spin_unlock_irq(&fput_lock);
-}
-
 /* __aio_put_req
  *	Returns true if this put was the last user of the request.
  */
@@ -540,17 +501,8 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 
-	/* Must be done under the lock to serialise against cancellation.
-	 * Call this aio_fput as it duplicates fput via the fput_work.
-	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
-		get_ioctx(ctx);
-		spin_lock(&fput_lock);
-		list_add(&req->ki_list, &fput_head);
-		spin_unlock(&fput_lock);
-		queue_work(aio_wq, &fput_work);
-	} else
-		really_put_req(ctx, req);
+	fput(req->ki_filp);
+	really_put_req(ctx, req);
 	return 1;
 }
 


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12  5:18           ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
  2009-03-12  5:42             ` [PATCH] aio: " Eric Dumazet
@ 2009-03-12  5:47             ` Andrew Morton
  2009-03-12  6:10               ` Eric Dumazet
  1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2009-03-12  5:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi, Christoph Lameter

On Thu, 12 Mar 2009 06:18:26 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:

> Eric Dumazet a __crit :
> > Eric Dumazet a __crit :
> >> Eric Dumazet a __crit :
> >>> Jeff Moyer a __crit :
> >>>> Avi Kivity <avi@redhat.com> writes:
> >>>>
> >>>>> Jeff Moyer wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Believe it or not, I get numerous questions from customers about the
> >>>>>> suggested tuning value of aio-max-nr.  aio-max-nr limits the total
> >>>>>> number of io events that can be reserved, system wide, for aio
> >>>>>> completions.  Each time io_setup is called, a ring buffer is allocated
> >>>>>> that can hold nr_events I/O completions.  That ring buffer is then
> >>>>>> mapped into the process' address space, and the pages are pinned in
> >>>>>> memory.  So, the reason for this upper limit (I believe) is to keep a
> >>>>>> malicious user from pinning all of kernel memory.  Now, this sounds like
> >>>>>> a much better job for the memlock rlimit to me, hence the following
> >>>>>> patch.
> >>>>>>   
> >>>>> Is it not possible to get rid of the pinning entirely?  Pinning
> >>>>> interferes with page migration which is important for NUMA, among
> >>>>> other issues.
> >>>> aio_complete is called from interrupt handlers, so can't block faulting
> >>>> in a page.  Zach mentions there is a possibility of handing completions
> >>>> off to a kernel thread, with all of the performance worries and extra
> >>>> bookkeeping that go along with such a scheme (to help frame my concerns,
> >>>> I often get lambasted over .5% performance regressions).
> >>> This aio_completion from interrupt handlers keep us from using SLAB_DESTROY_BY_RCU
> >>> instead of call_rcu() for "struct file" freeing.
> >>>
> >>> http://lkml.org/lkml/2008/12/17/364
> >>>
> >>> I would love if we could get rid of this mess...
> >> Speaking of that, I tried to take a look at this aio stuff and have one question.
> >>
> >> Assuming that __fput() cannot be called from interrupt context.
> >> -> fput() should not be called from interrupt context as well.
> >>
> >> How comes we call fput(req->ki_eventfd) from really_put_req()
> >> from interrupt context ?
> >>
> >> If user program closes eventfd, then inflight AIO requests can trigger
> >> a bug.
> >>
> > 
> > Path could be :
> > 
> > 1) fput() changes so that calling it from interrupt context is possible
> >    (Using a working queue to make sure __fput() is called from process context)
> > 
> > 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
> > 
> > 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
> >    SLAB_DESTROY_BY_RCU for "struct file" get back :)
> > 
> 
> Please find first patch against linux-2.6
> 
> Next patch (2) can cleanup aio code, but it probably can wait linux-2.6.30
> 
> Thank you
> 
> [PATCH] fs: fput() can be called from interrupt context
> 
> Current aio/eventfd code can call fput() from interrupt context, which is
> not allowed.

The changelog forgot to tell us where this happens, and under what
circumstances.

See, there might be other ways of fixing the bug,

> In order to fix the problem and prepare SLAB_DESTROY_BY_RCU use for "struct file"
> allocation/freeing in 2.6.30, we might extend existing workqueue infrastructure and
> allow fput() to be called from interrupt context.
> 
> This unfortunalty adds a pointer to 'struct file'.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/file.c               |   55 ++++++++++++++++++++++++++------------
>  fs/file_table.c         |   10 +++++-
>  include/linux/fdtable.h |    1
>  include/linux/fs.h      |    1
>  4 files changed, 49 insertions(+), 18 deletions(-)

which might not have some or all of the above problems.


I assume you're referring to really_put_req(), and commit
9c3060bedd84144653a2ad7bea32389f65598d40.

>From the above email straggle I extract "If user program closes
eventfd, then inflight AIO requests can trigger a bug" and I don't
immediately see anything in there which would prevent this.

Did you reproduce the bug, and confirm that the patch fixes it?

Are there simpler ways of fixing it?  Maybe sneak a call to
wait_for_all_aios() into the right place?  I doubt if it's performance
critical, as nobody seems to have ever hit the bug.

Bear in mind that if the bug _is_ real then it's now out there, and
we would like a fix which is usable by 2.6.<two-years-worth>.

etcetera..

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12  5:47             ` [PATCH] fs: " Andrew Morton
@ 2009-03-12  6:10               ` Eric Dumazet
  2009-03-12  6:39                 ` Andrew Morton
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12  6:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi, Christoph Lameter

Andrew Morton a écrit :
> On Thu, 12 Mar 2009 06:18:26 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Eric Dumazet wrote :
>>> Path could be :
>>>
>>> 1) fput() changes so that calling it from interrupt context is possible
>>>    (Using a working queue to make sure __fput() is called from process context)
>>>
>>> 2) Changes aio to use fput() as is (and zap its internal work_queue and aio_fput_routine() stuff)
>>>
>>> 3) Once atomic_long_dec_and_test(&filp->f_count) only performed in fput(),
>>>    SLAB_DESTROY_BY_RCU for "struct file" get back :)
>>>
>> Please find first patch against linux-2.6
>>
>> Next patch (2) can cleanup aio code, but it probably can wait linux-2.6.30
>>
>> Thank you
>>
>> [PATCH] fs: fput() can be called from interrupt context
>>
>> Current aio/eventfd code can call fput() from interrupt context, which is
>> not allowed.
> 
> The changelog forgot to tell us where this happens, and under what
> circumstances.
> 
> See, there might be other ways of fixing the bug,

Sure

> 
>> In order to fix the problem and prepare SLAB_DESTROY_BY_RCU use for "struct file"
>> allocation/freeing in 2.6.30, we might extend existing workqueue infrastructure and
>> allow fput() to be called from interrupt context.
>>
>> This unfortunalty adds a pointer to 'struct file'.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
>>  fs/file.c               |   55 ++++++++++++++++++++++++++------------
>>  fs/file_table.c         |   10 +++++-
>>  include/linux/fdtable.h |    1
>>  include/linux/fs.h      |    1
>>  4 files changed, 49 insertions(+), 18 deletions(-)
> 
> which might not have some or all of the above problems.
> 
> 
> I assume you're referring to really_put_req(), and commit
> 9c3060bedd84144653a2ad7bea32389f65598d40.
> 
>>From the above email straggle I extract "If user program closes
> eventfd, then inflight AIO requests can trigger a bug" and I don't
> immediately see anything in there which would prevent this.
> 
> Did you reproduce the bug, and confirm that the patch fixes it?

take Davide program : http://www.xmailserver.org/eventfd-aio-test.c

and add at line 318 :
close(afd);

It should produce the kernel bug...
> 
> Are there simpler ways of fixing it?  Maybe sneak a call to
> wait_for_all_aios() into the right place?  I doubt if it's performance
> critical, as nobody seems to have ever hit the bug.

Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
(or read my 2nd patch, it should spot the thing)

If you want to add another kludge to properly fput(req->ki_eventfd),
be my guest :-(

> 
> Bear in mind that if the bug _is_ real then it's now out there, and
> we would like a fix which is usable by 2.6.<two-years-worth>.
> 
> etcetera..





^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12  6:10               ` Eric Dumazet
@ 2009-03-12  6:39                 ` Andrew Morton
  2009-03-12 13:39                   ` Davide Libenzi
  2009-03-12 19:22                   ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
  0 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2009-03-12  6:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi, Christoph Lameter

On Thu, 12 Mar 2009 07:10:38 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:

> > 
> > Did you reproduce the bug, and confirm that the patch fixes it?
> 
> take Davide program : http://www.xmailserver.org/eventfd-aio-test.c
> 
> and add at line 318 :
> close(afd);
> 
> It should produce the kernel bug...

"should"?

> > 
> > Are there simpler ways of fixing it?  Maybe sneak a call to
> > wait_for_all_aios() into the right place?  I doubt if it's performance
> > critical, as nobody seems to have ever hit the bug.
> 
> Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> (or read my 2nd patch, it should spot the thing)

Well yes, a kludge like that seems a bit safer.

It's somewhat encouraging that we're apparently already doing fput()
from within keventd (although how frequently?).  There might be
problems with file locking, security code, etc from doing fput() from
an unexpected thread.  And then there are all the usual weird problem
with using the keventd queues which take a long time to get discovered.


> If you want to add another kludge to properly fput(req->ki_eventfd),
> be my guest :-(
> 
> > 
> > Bear in mind that if the bug _is_ real then it's now out there, and
> > we would like a fix which is usable by 2.6.<two-years-worth>.

The patches are large and scary and it would be a real problem to merge
them into 2.6.29 at this stage, let alone 2.6.25, etc.

Especially as the code which you sent out appears to be untested:

>  void fput(struct file *file)
>  {
> -	if (atomic_long_dec_and_test(&file->f_count))
> -		__fput(file);
> +	if (atomic_long_dec_and_test(&file->f_count)) {
> +		if (unlikely(!in_interrupt()))

                             ^

> +			fd_defer_queue(NULL, file);
> +		else
> +			__fput(file);
> +	}
>  }



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12  6:39                 ` Andrew Morton
@ 2009-03-12 13:39                   ` Davide Libenzi
  2009-03-13 22:34                     ` Davide Libenzi
  2009-03-13 23:28                     ` Trond Myklebust
  2009-03-12 19:22                   ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
  1 sibling, 2 replies; 49+ messages in thread
From: Davide Libenzi @ 2009-03-12 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	Benjamin LaHaise, Linux Kernel Mailing List, Christoph Lameter

On Wed, 11 Mar 2009, Andrew Morton wrote:

> > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > (or read my 2nd patch, it should spot the thing)
> 
> Well yes, a kludge like that seems a bit safer.
> 
> It's somewhat encouraging that we're apparently already doing fput()
> from within keventd (although how frequently?).  There might be
> problems with file locking, security code, etc from doing fput() from
> an unexpected thread.  And then there are all the usual weird problem
> with using the keventd queues which take a long time to get discovered.

Would it be a huge problem, performance-wise, to stop making ->f_count 
tricks in __aio_put_req, and always offload to fput_work the task of 
releasing the requests?
If that's a huge problem, IMO the lower impact fix would be to use 
aio_fput_routine to loop into a second list, releasing the eventual 
eventfd file*. There's no need, IMO, to turn the whole fput() into 
IRQ-callable just for this case, when we can contain it into the 
particular KAIO+eventfd usage.


- Davide



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12  6:39                 ` Andrew Morton
  2009-03-12 13:39                   ` Davide Libenzi
@ 2009-03-12 19:22                   ` Eric Dumazet
  2009-03-12 20:21                     ` Andrew Morton
  1 sibling, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-12 19:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Moyer, Avi Kivity, linux-aio, zach.brown, bcrl,
	linux-kernel, Davide Libenzi, Christoph Lameter

Andrew Morton a écrit :
> On Thu, 12 Mar 2009 07:10:38 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>>> Did you reproduce the bug, and confirm that the patch fixes it?
>> take Davide program : http://www.xmailserver.org/eventfd-aio-test.c
>>
>> and add at line 318 :
>> close(afd);
>>
>> It should produce the kernel bug...
> 
> "should"?

Sorry I had to run this morning, so I was a litle bit lazy...

Maybe some kind of O_DIRECT / NFS / blockdev setup or something calling 
aio_complete() from interrupt context ? I am not familiar of this part,
but considering spin_lock_irq()/spin_unlock_irq() calls in fs/aio.c,
there must be a reason for this ?

> 
>>> Are there simpler ways of fixing it?  Maybe sneak a call to
>>> wait_for_all_aios() into the right place?  I doubt if it's performance
>>> critical, as nobody seems to have ever hit the bug.
>> Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
>> (or read my 2nd patch, it should spot the thing)
> 
> Well yes, a kludge like that seems a bit safer.
> 
> It's somewhat encouraging that we're apparently already doing fput()
> from within keventd (although how frequently?).  There might be
> problems with file locking, security code, etc from doing fput() from
> an unexpected thread.  And then there are all the usual weird problem
> with using the keventd queues which take a long time to get discovered.
> 

Hum... Do you mean "if (in_interrupt())" is not the right test to
perform ? 

> 
>> If you want to add another kludge to properly fput(req->ki_eventfd),
>> be my guest :-(
>>
>>> Bear in mind that if the bug _is_ real then it's now out there, and
>>> we would like a fix which is usable by 2.6.<two-years-worth>.
> 
> The patches are large and scary and it would be a real problem to merge
> them into 2.6.29 at this stage, let alone 2.6.25, etc.
> 
> Especially as the code which you sent out appears to be untested:
> 

I actually tested it and got a working kernel before sending patch,
please trust me...

Oh yes, my dev machine always called the slow path then, this is
why I didnt noticed.

Thank you for spotting this inverted test.

>>  void fput(struct file *file)
>>  {
>> -	if (atomic_long_dec_and_test(&file->f_count))
>> -		__fput(file);
>> +	if (atomic_long_dec_and_test(&file->f_count)) {
>> +		if (unlikely(!in_interrupt()))
> 
>                              ^
> 
>> +			fd_defer_queue(NULL, file);
>> +		else
>> +			__fput(file);
>> +	}
>>  }
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12 19:22                   ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
@ 2009-03-12 20:21                     ` Andrew Morton
  0 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2009-03-12 20:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: jmoyer, avi, linux-aio, zach.brown, bcrl, linux-kernel, davidel, cl

On Thu, 12 Mar 2009 20:22:06 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> > It's somewhat encouraging that we're apparently already doing fput()
> > from within keventd (although how frequently?).  There might be
> > problems with file locking, security code, etc from doing fput() from
> > an unexpected thread.  And then there are all the usual weird problem
> > with using the keventd queues which take a long time to get discovered.
> > 
> 
> Hum... Do you mean "if (in_interrupt())" is not the right test to
> perform ? 

No, in_interrupt() seems appropriate.

It's a bit weak, because it means that the proposed code can't be used
for calling fput() inside spinlock.  A better interface would be to
add a new fput_atomic() and let the caller decide which functions
should be called, depending upon what context the caller is running in.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12 13:39                   ` Davide Libenzi
@ 2009-03-13 22:34                     ` Davide Libenzi
  2009-03-13 22:43                       ` Eric Dumazet
  2009-03-13 23:28                     ` Trond Myklebust
  1 sibling, 1 reply; 49+ messages in thread
From: Davide Libenzi @ 2009-03-13 22:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	Benjamin LaHaise, Linux Kernel Mailing List, Christoph Lameter

On Thu, 12 Mar 2009, Davide Libenzi wrote:

> On Wed, 11 Mar 2009, Andrew Morton wrote:
> 
> > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > (or read my 2nd patch, it should spot the thing)
> > 
> > Well yes, a kludge like that seems a bit safer.
> > 
> > It's somewhat encouraging that we're apparently already doing fput()
> > from within keventd (although how frequently?).  There might be
> > problems with file locking, security code, etc from doing fput() from
> > an unexpected thread.  And then there are all the usual weird problem
> > with using the keventd queues which take a long time to get discovered.
> 
> Would it be a huge problem, performance-wise, to stop making ->f_count 
> tricks in __aio_put_req, and always offload to fput_work the task of 
> releasing the requests?
> If that's a huge problem, IMO the lower impact fix would be to use 
> aio_fput_routine to loop into a second list, releasing the eventual 
> eventfd file*. There's no need, IMO, to turn the whole fput() into 
> IRQ-callable just for this case, when we can contain it into the 
> particular KAIO+eventfd usage.

Eric, are you working on this or should I? I missed where the conversation 
between you and Andrew is, at this point, WRT this issue.


- Davide



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-13 22:34                     ` Davide Libenzi
@ 2009-03-13 22:43                       ` Eric Dumazet
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Dumazet @ 2009-03-13 22:43 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrew Morton, Jeff Moyer, Avi Kivity, linux-aio, zach.brown,
	Benjamin LaHaise, Linux Kernel Mailing List, Christoph Lameter

Davide Libenzi a écrit :
> On Thu, 12 Mar 2009, Davide Libenzi wrote:
> 
>> On Wed, 11 Mar 2009, Andrew Morton wrote:
>>
>>>> Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
>>>> (or read my 2nd patch, it should spot the thing)
>>> Well yes, a kludge like that seems a bit safer.
>>>
>>> It's somewhat encouraging that we're apparently already doing fput()
>>> from within keventd (although how frequently?).  There might be
>>> problems with file locking, security code, etc from doing fput() from
>>> an unexpected thread.  And then there are all the usual weird problem
>>> with using the keventd queues which take a long time to get discovered.
>> Would it be a huge problem, performance-wise, to stop making ->f_count 
>> tricks in __aio_put_req, and always offload to fput_work the task of 
>> releasing the requests?
>> If that's a huge problem, IMO the lower impact fix would be to use 
>> aio_fput_routine to loop into a second list, releasing the eventual 
>> eventfd file*. There's no need, IMO, to turn the whole fput() into 
>> IRQ-callable just for this case, when we can contain it into the 
>> particular KAIO+eventfd usage.
> 
> Eric, are you working on this or should I? I missed where the conversation 
> between you and Andrew is, at this point, WRT this issue.

I wish I could, but I currently have too much stress from my day job.

BTW, I could not produce a crash, so this problem is hypothetic for the moment :)



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-12 13:39                   ` Davide Libenzi
  2009-03-13 22:34                     ` Davide Libenzi
@ 2009-03-13 23:28                     ` Trond Myklebust
  2009-03-14  1:40                       ` Davide Libenzi
  1 sibling, 1 reply; 49+ messages in thread
From: Trond Myklebust @ 2009-03-13 23:28 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrew Morton, Eric Dumazet, Jeff Moyer, Avi Kivity, linux-aio,
	zach.brown, Benjamin LaHaise, Linux Kernel Mailing List,
	Christoph Lameter

On Thu, 2009-03-12 at 06:39 -0700, Davide Libenzi wrote:
> On Wed, 11 Mar 2009, Andrew Morton wrote:
> 
> > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > (or read my 2nd patch, it should spot the thing)
> > 
> > Well yes, a kludge like that seems a bit safer.
> > 
> > It's somewhat encouraging that we're apparently already doing fput()
> > from within keventd (although how frequently?).  There might be
> > problems with file locking, security code, etc from doing fput() from
> > an unexpected thread.  And then there are all the usual weird problem
> > with using the keventd queues which take a long time to get discovered.
> 
> Would it be a huge problem, performance-wise, to stop making ->f_count 
> tricks in __aio_put_req, and always offload to fput_work the task of 
> releasing the requests?
> If that's a huge problem, IMO the lower impact fix would be to use 
> aio_fput_routine to loop into a second list, releasing the eventual 
> eventfd file*. There's no need, IMO, to turn the whole fput() into 
> IRQ-callable just for this case, when we can contain it into the 
> particular KAIO+eventfd usage.
> 

Do you really want to see eventd doing umounts and remote flock() calls?
This really needs to be run in a thread that can cope with __long__
waits, unavailable servers, etc...

Trond


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-13 23:28                     ` Trond Myklebust
@ 2009-03-14  1:40                       ` Davide Libenzi
  2009-03-14  4:02                         ` Trond Myklebust
  0 siblings, 1 reply; 49+ messages in thread
From: Davide Libenzi @ 2009-03-14  1:40 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Eric Dumazet, Jeff Moyer, Avi Kivity, linux-aio,
	zach.brown, Benjamin LaHaise, Linux Kernel Mailing List,
	Christoph Lameter

On Fri, 13 Mar 2009, Trond Myklebust wrote:

> On Thu, 2009-03-12 at 06:39 -0700, Davide Libenzi wrote:
> > On Wed, 11 Mar 2009, Andrew Morton wrote:
> > 
> > > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > > (or read my 2nd patch, it should spot the thing)
> > > 
> > > Well yes, a kludge like that seems a bit safer.
> > > 
> > > It's somewhat encouraging that we're apparently already doing fput()
> > > from within keventd (although how frequently?).  There might be
> > > problems with file locking, security code, etc from doing fput() from
> > > an unexpected thread.  And then there are all the usual weird problem
> > > with using the keventd queues which take a long time to get discovered.
> > 
> > Would it be a huge problem, performance-wise, to stop making ->f_count 
> > tricks in __aio_put_req, and always offload to fput_work the task of 
> > releasing the requests?
> > If that's a huge problem, IMO the lower impact fix would be to use 
> > aio_fput_routine to loop into a second list, releasing the eventual 
> > eventfd file*. There's no need, IMO, to turn the whole fput() into 
> > IRQ-callable just for this case, when we can contain it into the 
> > particular KAIO+eventfd usage.
> > 
> 
> Do you really want to see eventd doing umounts and remote flock() calls?
> This really needs to be run in a thread that can cope with __long__
> waits, unavailable servers, etc...

Did I miss something? This wouldn't be (eventually) on keventd shoulders, 
but on aio work queue (aio_wq).
I guess we could do the same optimization we're already doing for ki_filp, 
for ki_eventfd ...



- Davide


---
 fs/aio.c |   32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c	2009-03-13 18:19:34.000000000 -0700
+++ linux-2.6.mod/fs/aio.c	2009-03-13 18:32:25.000000000 -0700
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
-	if (!IS_ERR(req->ki_eventfd))
-		fput(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
 		list_del(&req->ki_list);
 		spin_unlock_irq(&fput_lock);
 
-		/* Complete the fput */
-		__fput(req->ki_filp);
+		/* Complete the fput(s) */
+		if (req->ki_filp != NULL)
+			__fput(req->ki_filp);
+		if (!IS_ERR(req->ki_eventfd))
+			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
+	int schedule_putreq = 0;
+
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 
-	/* Must be done under the lock to serialise against cancellation.
-	 * Call this aio_fput as it duplicates fput via the fput_work.
+	/*
+	 * Try to optimize the aio and eventfd file* puts, by avoiding to
+	 * schedule work in case it is not __fput() time. In normal cases,
+	 * we wouldn not be holding the last reference to the file*, so
+	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+		schedule_putreq++;
+	else
+		req->ki_filp = NULL;
+	if (unlikely(!IS_ERR(req->ki_eventfd))) {
+		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+			schedule_putreq++;
+		else
+			req->ki_eventfd = ERR_PTR(-EINVAL);
+	}
+	if (unlikely(schedule_putreq)) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-14  1:40                       ` Davide Libenzi
@ 2009-03-14  4:02                         ` Trond Myklebust
  2009-03-14 14:32                           ` Davide Libenzi
  0 siblings, 1 reply; 49+ messages in thread
From: Trond Myklebust @ 2009-03-14  4:02 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrew Morton, Eric Dumazet, Jeff Moyer, Avi Kivity, linux-aio,
	zach.brown, Benjamin LaHaise, Linux Kernel Mailing List,
	Christoph Lameter

On Fri, 2009-03-13 at 18:40 -0700, Davide Libenzi wrote:
> On Fri, 13 Mar 2009, Trond Myklebust wrote:
> 
> > On Thu, 2009-03-12 at 06:39 -0700, Davide Libenzi wrote:
> > > On Wed, 11 Mar 2009, Andrew Morton wrote:
> > > 
> > > > > Take the time to check how fs/aio.c handle the fput(req->ki_filp) case
> > > > > (or read my 2nd patch, it should spot the thing)
> > > > 
> > > > Well yes, a kludge like that seems a bit safer.
> > > > 
> > > > It's somewhat encouraging that we're apparently already doing fput()
> > > > from within keventd (although how frequently?).  There might be
> > > > problems with file locking, security code, etc from doing fput() from
> > > > an unexpected thread.  And then there are all the usual weird problem
> > > > with using the keventd queues which take a long time to get discovered.
> > > 
> > > Would it be a huge problem, performance-wise, to stop making ->f_count 
> > > tricks in __aio_put_req, and always offload to fput_work the task of 
> > > releasing the requests?
> > > If that's a huge problem, IMO the lower impact fix would be to use 
> > > aio_fput_routine to loop into a second list, releasing the eventual 
> > > eventfd file*. There's no need, IMO, to turn the whole fput() into 
> > > IRQ-callable just for this case, when we can contain it into the 
> > > particular KAIO+eventfd usage.
> > > 
> > 
> > Do you really want to see eventd doing umounts and remote flock() calls?
> > This really needs to be run in a thread that can cope with __long__
> > waits, unavailable servers, etc...
> 
> Did I miss something? This wouldn't be (eventually) on keventd shoulders, 
> but on aio work queue (aio_wq).
> I guess we could do the same optimization we're already doing for ki_filp, 
> for ki_eventfd ...

Last I checked, a call to schedule_work(&fddef->wq) would still run
fd_defer_queue() under keventd.

That said, even if you were to run it under aio_wq, the argument remains
the same: do you really want to add potentially long lasting sleeps onto
a work queue that was designed to service fast i/o requests?

Trond


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] fs: fput() can be called from interrupt context
  2009-03-14  4:02                         ` Trond Myklebust
@ 2009-03-14 14:32                           ` Davide Libenzi
  2009-03-15  1:36                             ` [patch] eventfd - remove fput() call from possible IRQ context Davide Libenzi
  0 siblings, 1 reply; 49+ messages in thread
From: Davide Libenzi @ 2009-03-14 14:32 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Eric Dumazet, Jeff Moyer, Avi Kivity, linux-aio,
	zach.brown, Benjamin LaHaise, Linux Kernel Mailing List,
	Christoph Lameter

On Sat, 14 Mar 2009, Trond Myklebust wrote:

> > Did I miss something? This wouldn't be (eventually) on keventd shoulders, 
> > but on aio work queue (aio_wq).
> > I guess we could do the same optimization we're already doing for ki_filp, 
> > for ki_eventfd ...
> 
> Last I checked, a call to schedule_work(&fddef->wq) would still run
> fd_defer_queue() under keventd.

Oh, you were commenting about Eric's patch-set. I thought you were talking 
about the current status (that my reply to Andrew was hinting to), and was 
wondering for a while from where fd_defer_queue() came from.



> That said, even if you were to run it under aio_wq, the argument remains
> the same: do you really want to add potentially long lasting sleeps onto
> a work queue that was designed to service fast i/o requests?

No, and if you roll back, you couldn't miss to notice that mine was a 
question about the f_count direct manipulation/optimization. We can do the 
same optimization for eventfd, although both will fall in the aio kthread 
in any case, if we're holding the last insteance of the file* (that I 
guess we can consider an unlikely case).
Dunno how worth it could be adding a new NCPU*thread(s) just to handle 
delayed fputs.


- Davide



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch] eventfd - remove fput() call from possible IRQ context
  2009-03-14 14:32                           ` Davide Libenzi
@ 2009-03-15  1:36                             ` Davide Libenzi
  2009-03-15 17:44                               ` Benjamin LaHaise
  0 siblings, 1 reply; 49+ messages in thread
From: Davide Libenzi @ 2009-03-15  1:36 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Trond Myklebust, Andrew Morton, Eric Dumazet, linux-aio,
	zach.brown, Benjamin LaHaise

The following patch remove a possible source of fput() call from inside 
IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
from IRQ context, but conceptually the bug is there.
This patch adds an optimization similar to the one we already do on 
->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
in general, but the alternative here would be to add a brand new delayed 
fput() infrastructure, that I'm not sure is worth it.


Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide


---
 fs/aio.c |   32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c	2009-03-13 18:19:34.000000000 -0700
+++ linux-2.6.mod/fs/aio.c	2009-03-13 18:32:25.000000000 -0700
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
-	if (!IS_ERR(req->ki_eventfd))
-		fput(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
 		list_del(&req->ki_list);
 		spin_unlock_irq(&fput_lock);
 
-		/* Complete the fput */
-		__fput(req->ki_filp);
+		/* Complete the fput(s) */
+		if (req->ki_filp != NULL)
+			__fput(req->ki_filp);
+		if (!IS_ERR(req->ki_eventfd))
+			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
+	int schedule_putreq = 0;
+
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 
-	/* Must be done under the lock to serialise against cancellation.
-	 * Call this aio_fput as it duplicates fput via the fput_work.
+	/*
+	 * Try to optimize the aio and eventfd file* puts, by avoiding to
+	 * schedule work in case it is not __fput() time. In normal cases,
+	 * we wouldn not be holding the last reference to the file*, so
+	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+		schedule_putreq++;
+	else
+		req->ki_filp = NULL;
+	if (unlikely(!IS_ERR(req->ki_eventfd))) {
+		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+			schedule_putreq++;
+		else
+			req->ki_eventfd = ERR_PTR(-EINVAL);
+	}
+	if (unlikely(schedule_putreq)) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context
  2009-03-15  1:36                             ` [patch] eventfd - remove fput() call from possible IRQ context Davide Libenzi
@ 2009-03-15 17:44                               ` Benjamin LaHaise
  2009-03-15 20:08                                 ` [patch] eventfd - remove fput() call from possible IRQ context (2nd rev) Davide Libenzi
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin LaHaise @ 2009-03-15 17:44 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Trond Myklebust, Andrew Morton,
	Eric Dumazet, linux-aio, zach.brown

On Sat, Mar 14, 2009 at 06:36:34PM -0700, Davide Libenzi wrote:
> The following patch remove a possible source of fput() call from inside 
> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
> from IRQ context, but conceptually the bug is there.
> This patch adds an optimization similar to the one we already do on 
> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
> in general, but the alternative here would be to add a brand new delayed 
> fput() infrastructure, that I'm not sure is worth it.

This looks reasonably sane, the only concern I have with it is that I think 
it logically makes more sense to use the same convention for fi_filp and 
ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit 
confusing.  Aside from that, it looks like it should fix the problem 
correctly.

		-ben

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-15 17:44                               ` Benjamin LaHaise
@ 2009-03-15 20:08                                 ` Davide Libenzi
  2009-03-16 17:25                                   ` Jamie Lokier
  2009-03-18 14:22                                   ` Jeff Moyer
  0 siblings, 2 replies; 49+ messages in thread
From: Davide Libenzi @ 2009-03-15 20:08 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Benjamin LaHaise, Trond Myklebust, Andrew Morton, Eric Dumazet,
	linux-aio, zach.brown

The following patch remove a possible source of fput() call from inside 
IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
from IRQ context, but conceptually the bug is there.
This patch adds an optimization similar to the one we already do on 
->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
in general, but the alternative here would be to add a brand new delayed 
fput() infrastructure, that I'm not sure is worth it.



On Sun, 15 Mar 2009, Benjamin LaHaise wrote:

> This looks reasonably sane, the only concern I have with it is that I think 
> it logically makes more sense to use the same convention for fi_filp and 
> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit 
> confusing.  Aside from that, it looks like it should fix the problem 
> correctly.

Makes sense.



Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide


---
 fs/aio.c |   37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c	2009-03-14 09:24:12.000000000 -0700
+++ linux-2.6.mod/fs/aio.c	2009-03-15 12:54:10.000000000 -0700
@@ -443,7 +443,7 @@ static struct kiocb *__aio_get_req(struc
 	req->private = NULL;
 	req->ki_iovec = NULL;
 	INIT_LIST_HEAD(&req->ki_run_list);
-	req->ki_eventfd = ERR_PTR(-EINVAL);
+	req->ki_eventfd = NULL;
 
 	/* Check if the completion queue has enough free space to
 	 * accept an event from this io.
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
-	if (!IS_ERR(req->ki_eventfd))
-		fput(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
 		list_del(&req->ki_list);
 		spin_unlock_irq(&fput_lock);
 
-		/* Complete the fput */
-		__fput(req->ki_filp);
+		/* Complete the fput(s) */
+		if (req->ki_filp != NULL)
+			__fput(req->ki_filp);
+		if (req->ki_eventfd != NULL)
+			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
+	int schedule_putreq = 0;
+
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 
-	/* Must be done under the lock to serialise against cancellation.
-	 * Call this aio_fput as it duplicates fput via the fput_work.
+	/*
+	 * Try to optimize the aio and eventfd file* puts, by avoiding to
+	 * schedule work in case it is not __fput() time. In normal cases,
+	 * we wouldn not be holding the last reference to the file*, so
+	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+		schedule_putreq++;
+	else
+		req->ki_filp = NULL;
+	if (unlikely(req->ki_eventfd != NULL)) {
+		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+			schedule_putreq++;
+		else
+			req->ki_eventfd = NULL;
+	}
+	if (unlikely(schedule_putreq)) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
@@ -1009,7 +1025,7 @@ int aio_complete(struct kiocb *iocb, lon
 	 * eventfd. The eventfd_signal() function is safe to be called
 	 * from IRQ context.
 	 */
-	if (!IS_ERR(iocb->ki_eventfd))
+	if (iocb->ki_eventfd != NULL)
 		eventfd_signal(iocb->ki_eventfd, 1);
 
 put_rq:
@@ -1608,6 +1624,7 @@ static int io_submit_one(struct kioctx *
 		req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
 		if (IS_ERR(req->ki_eventfd)) {
 			ret = PTR_ERR(req->ki_eventfd);
+			req->ki_eventfd = NULL;
 			goto out_put_req;
 		}
 	}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-15 20:08                                 ` [patch] eventfd - remove fput() call from possible IRQ context (2nd rev) Davide Libenzi
@ 2009-03-16 17:25                                   ` Jamie Lokier
  2009-03-16 18:36                                     ` Davide Libenzi
  2009-03-18 14:22                                   ` Jeff Moyer
  1 sibling, 1 reply; 49+ messages in thread
From: Jamie Lokier @ 2009-03-16 17:25 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Benjamin LaHaise, Trond Myklebust,
	Andrew Morton, Eric Dumazet, linux-aio, zach.brown

Davide Libenzi wrote:
> +	if (unlikely(req->ki_eventfd != NULL)) {

Is this really unlikely?  Having an eventfd seems to be the modern,
cool, recommended way to do AIO.

The other unlikelies look ok.

-- Jamie

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-16 17:25                                   ` Jamie Lokier
@ 2009-03-16 18:36                                     ` Davide Libenzi
  0 siblings, 0 replies; 49+ messages in thread
From: Davide Libenzi @ 2009-03-16 18:36 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linux Kernel Mailing List, Benjamin LaHaise, Trond Myklebust,
	Andrew Morton, Eric Dumazet, linux-aio, zach.brown

On Mon, 16 Mar 2009, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > +	if (unlikely(req->ki_eventfd != NULL)) {
> 
> Is this really unlikely?  Having an eventfd seems to be the modern,
> cool, recommended way to do AIO.

I don't have statistics at hand, but I guess that nowadays AIO+eventfd is 
still not very popular. It doesn't make much of a difference either ways I 
guess, so we could drop the unlikely().



- Davide



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-15 20:08                                 ` [patch] eventfd - remove fput() call from possible IRQ context (2nd rev) Davide Libenzi
  2009-03-16 17:25                                   ` Jamie Lokier
@ 2009-03-18 14:22                                   ` Jeff Moyer
  2009-03-18 14:46                                     ` Davide Libenzi
                                                       ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-18 14:22 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Benjamin LaHaise, Trond Myklebust,
	Andrew Morton, Eric Dumazet, linux-aio, zach.brown

Davide Libenzi <davidel@xmailserver.org> writes:

> The following patch remove a possible source of fput() call from inside 
> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
> from IRQ context, but conceptually the bug is there.

I've attached a test program which can reproduce the fput call in
interrupt context.  It's a modified version of the eventfd test that
Rusty wrote for the libaio test harness.  I verified that fput was in
fact being called in interrupt context by using systemtap to print out
the "thead_indent" of fput calls, and observing a "swapper(0)" in the
output.  After applying your patch, I confirmed that __fput is no longer
called from interrupt context.  Strangely enough, I never did get any
output from the might_sleep in __fput.  I can't explain that.

I have some minor comments inlined below.

> This patch adds an optimization similar to the one we already do on 
> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
> in general, but the alternative here would be to add a brand new delayed 
> fput() infrastructure, that I'm not sure is worth it.
>
> On Sun, 15 Mar 2009, Benjamin LaHaise wrote:
>
>> This looks reasonably sane, the only concern I have with it is that I think 
>> it logically makes more sense to use the same convention for fi_filp and 
>> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit 
>> confusing.  Aside from that, it looks like it should fix the problem 
>> correctly.
>
> Makes sense.
>
> Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
>
>
> - Davide
>
>
> ---
>  fs/aio.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
>
> Index: linux-2.6.mod/fs/aio.c
> ===================================================================
> --- linux-2.6.mod.orig/fs/aio.c	2009-03-14 09:24:12.000000000 -0700
> +++ linux-2.6.mod/fs/aio.c	2009-03-15 12:54:10.000000000 -0700
...
> @@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
>   */
>  static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
>  {
> +	int schedule_putreq = 0;
> +
>  	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
>  		req, atomic_long_read(&req->ki_filp->f_count));
>  
>  	assert_spin_locked(&ctx->ctx_lock);
>  
> -	req->ki_users --;
> +	req->ki_users--;
>  	BUG_ON(req->ki_users < 0);
>  	if (likely(req->ki_users))
>  		return 0;
> @@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
>  	req->ki_cancel = NULL;
>  	req->ki_retry = NULL;
>  
> -	/* Must be done under the lock to serialise against cancellation.
> -	 * Call this aio_fput as it duplicates fput via the fput_work.
> +	/*
> +	 * Try to optimize the aio and eventfd file* puts, by avoiding to
> +	 * schedule work in case it is not __fput() time. In normal cases,
> +	 * we wouldn not be holding the last reference to the file*, so
              ^^^^^^^^^^
tyop

> +	 * this function will be executed w/out any aio kthread wakeup.
>  	 */
> -	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
> +	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
> +		schedule_putreq++;
> +	else
> +		req->ki_filp = NULL;
> +	if (unlikely(req->ki_eventfd != NULL)) {
> +		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
> +			schedule_putreq++;
> +		else
> +			req->ki_eventfd = NULL;
> +	}

I agree with Jamie that you should get rid of the unlikely.

Thanks for taking care of this, Davide.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Cheers,
Jeff

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/types.h>
#include <errno.h>
#include <assert.h>
#include <sys/eventfd.h>
#include <libaio.h>

int
main(int argc, char **argv)
{
#define SIZE	(256*1024*1024)
	char *buf;
	struct io_event io_event;
	struct iocb iocb;
	struct iocb *iocbs[] = { &iocb };
	int rwfd, efd;
	int res;
	io_context_t	io_ctx;

	efd = eventfd(0, 0);
	if (efd < 0) {
		perror("eventfd");
		exit(1);
	}

	rwfd = open("rwfile", O_RDWR|O_DIRECT);		assert(rwfd != -1);
	if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
		perror("posix_memalign");
		exit(1);
	}
	memset(buf, 0x42, SIZE);

	/* Write test. */
	res = io_queue_init(1024, &io_ctx);		assert(res == 0);
	io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
	io_set_eventfd(&iocb, efd);
	res = io_submit(io_ctx, 1, iocbs);		assert(res == 1);

	/* Now close the eventfd so that AIO has the last reference */
	close(efd);

	/* Keep this process around so that the aio subsystem does not hold
	 * the last reference on the rwfd, otherwise the really_put_req will
	 * be called from process context */
	res = io_getevents(io_ctx, 1, 1, &io_event, NULL);
	if (res != 1) {
		if (res < 0) {
			errno = -res;
			perror("io_getevents");
		} else
			printf("io_getevents did not return 1 event after "
			       "closing eventfd\n");
		exit(1);
	}
	assert(io_event.res == SIZE);
	printf("eventfd write test [SUCCESS]\n");

	return 0;
}
/*
 * Local variables:
 *   c-basic-offset: 8
 *   compile-command: "gcc -o eventfd-in-irq eventfd-in-irq.c -laio -g3"
 * End:
 */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-18 14:22                                   ` Jeff Moyer
@ 2009-03-18 14:46                                     ` Davide Libenzi
  2009-03-18 14:55                                     ` Eric Dumazet
  2009-03-18 17:25                                     ` [patch] eventfd - remove fput() call from possible IRQ context (3rd rev) Davide Libenzi
  2 siblings, 0 replies; 49+ messages in thread
From: Davide Libenzi @ 2009-03-18 14:46 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Linux Kernel Mailing List, Benjamin LaHaise, Trond Myklebust,
	Andrew Morton, Eric Dumazet, linux-aio, zach.brown

On Wed, 18 Mar 2009, Jeff Moyer wrote:

> I agree with Jamie that you should get rid of the unlikely.

Ok, will repost today with typo and unlikely fixed.


- Davide



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-18 14:22                                   ` Jeff Moyer
  2009-03-18 14:46                                     ` Davide Libenzi
@ 2009-03-18 14:55                                     ` Eric Dumazet
  2009-03-18 15:25                                       ` Jeff Moyer
  2009-03-18 17:25                                     ` [patch] eventfd - remove fput() call from possible IRQ context (3rd rev) Davide Libenzi
  2 siblings, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-18 14:55 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Davide Libenzi, Linux Kernel Mailing List, Benjamin LaHaise,
	Trond Myklebust, Andrew Morton, linux-aio, zach.brown

Jeff Moyer a écrit :
> Davide Libenzi <davidel@xmailserver.org> writes:
> 
>> The following patch remove a possible source of fput() call from inside 
>> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
>> from IRQ context, but conceptually the bug is there.
> 
> I've attached a test program which can reproduce the fput call in
> interrupt context.  It's a modified version of the eventfd test that
> Rusty wrote for the libaio test harness.  I verified that fput was in
> fact being called in interrupt context by using systemtap to print out
> the "thead_indent" of fput calls, and observing a "swapper(0)" in the
> output.  After applying your patch, I confirmed that __fput is no longer
> called from interrupt context.  Strangely enough, I never did get any
> output from the might_sleep in __fput.  I can't explain that.
> 
> I have some minor comments inlined below.
> 
>> This patch adds an optimization similar to the one we already do on 
>> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
>> in general, but the alternative here would be to add a brand new delayed 
>> fput() infrastructure, that I'm not sure is worth it.
>>
>> On Sun, 15 Mar 2009, Benjamin LaHaise wrote:
>>
>>> This looks reasonably sane, the only concern I have with it is that I think 
>>> it logically makes more sense to use the same convention for fi_filp and 
>>> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit 
>>> confusing.  Aside from that, it looks like it should fix the problem 
>>> correctly.
>> Makes sense.
>>
>> Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
>>
>>
>> - Davide
>>
>>
>> ---
>>  fs/aio.c |   37 +++++++++++++++++++++++++++----------
>>  1 file changed, 27 insertions(+), 10 deletions(-)
>>
>> Index: linux-2.6.mod/fs/aio.c
>> ===================================================================
>> --- linux-2.6.mod.orig/fs/aio.c	2009-03-14 09:24:12.000000000 -0700
>> +++ linux-2.6.mod/fs/aio.c	2009-03-15 12:54:10.000000000 -0700
> ...
>> @@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
>>   */
>>  static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
>>  {
>> +	int schedule_putreq = 0;
>> +
>>  	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
>>  		req, atomic_long_read(&req->ki_filp->f_count));
>>  
>>  	assert_spin_locked(&ctx->ctx_lock);
>>  
>> -	req->ki_users --;
>> +	req->ki_users--;
>>  	BUG_ON(req->ki_users < 0);
>>  	if (likely(req->ki_users))
>>  		return 0;
>> @@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
>>  	req->ki_cancel = NULL;
>>  	req->ki_retry = NULL;
>>  
>> -	/* Must be done under the lock to serialise against cancellation.
>> -	 * Call this aio_fput as it duplicates fput via the fput_work.
>> +	/*
>> +	 * Try to optimize the aio and eventfd file* puts, by avoiding to
>> +	 * schedule work in case it is not __fput() time. In normal cases,
>> +	 * we wouldn not be holding the last reference to the file*, so
>               ^^^^^^^^^^
> tyop
> 
>> +	 * this function will be executed w/out any aio kthread wakeup.
>>  	 */
>> -	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
>> +	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
>> +		schedule_putreq++;
>> +	else
>> +		req->ki_filp = NULL;
>> +	if (unlikely(req->ki_eventfd != NULL)) {
>> +		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
>> +			schedule_putreq++;
>> +		else
>> +			req->ki_eventfd = NULL;
>> +	}
> 
> I agree with Jamie that you should get rid of the unlikely.
> 
> Thanks for taking care of this, Davide.
> 
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
> 
> Cheers,
> Jeff
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <string.h>
> #include <sys/types.h>
> #include <errno.h>
> #include <assert.h>
> #include <sys/eventfd.h>
> #include <libaio.h>
> 
> int
> main(int argc, char **argv)
> {
> #define SIZE	(256*1024*1024)
> 	char *buf;
> 	struct io_event io_event;
> 	struct iocb iocb;
> 	struct iocb *iocbs[] = { &iocb };
> 	int rwfd, efd;
> 	int res;
> 	io_context_t	io_ctx;
> 
> 	efd = eventfd(0, 0);
> 	if (efd < 0) {
> 		perror("eventfd");
> 		exit(1);
> 	}
> 
> 	rwfd = open("rwfile", O_RDWR|O_DIRECT);		assert(rwfd != -1);
> 	if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
> 		perror("posix_memalign");
> 		exit(1);
> 	}
> 	memset(buf, 0x42, SIZE);
> 
> 	/* Write test. */
> 	res = io_queue_init(1024, &io_ctx);		assert(res == 0);
> 	io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
> 	io_set_eventfd(&iocb, efd);
> 	res = io_submit(io_ctx, 1, iocbs);		assert(res == 1);

yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c

I suggest you start a thread just before io_submit() and give it this work :

void *thread_work(void *arg)
{
	usleep(10000);
	close(efd);
	return (void *)0;
}

> 
> 	/* Now close the eventfd so that AIO has the last reference */
> 	close(efd);
> 
> 	/* Keep this process around so that the aio subsystem does not hold
> 	 * the last reference on the rwfd, otherwise the really_put_req will
> 	 * be called from process context */
> 	res = io_getevents(io_ctx, 1, 1, &io_event, NULL);
> 	if (res != 1) {
> 		if (res < 0) {
> 			errno = -res;
> 			perror("io_getevents");
> 		} else
> 			printf("io_getevents did not return 1 event after "
> 			       "closing eventfd\n");
> 		exit(1);
> 	}
> 	assert(io_event.res == SIZE);
> 	printf("eventfd write test [SUCCESS]\n");
> 
> 	return 0;
> }
> /*
>  * Local variables:
>  *   c-basic-offset: 8
>  *   compile-command: "gcc -o eventfd-in-irq eventfd-in-irq.c -laio -g3"
>  * End:
>  */
> 
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-18 14:55                                     ` Eric Dumazet
@ 2009-03-18 15:25                                       ` Jeff Moyer
  2009-03-18 15:43                                         ` Eric Dumazet
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff Moyer @ 2009-03-18 15:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Linux Kernel Mailing List, Benjamin LaHaise,
	Trond Myklebust, Andrew Morton, linux-aio, zach.brown

Eric Dumazet <dada1@cosmosbay.com> writes:


>> 	rwfd = open("rwfile", O_RDWR|O_DIRECT);		assert(rwfd != -1);
>> 	if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
>> 		perror("posix_memalign");
>> 		exit(1);
>> 	}
>> 	memset(buf, 0x42, SIZE);
>> 
>> 	/* Write test. */
>> 	res = io_queue_init(1024, &io_ctx);		assert(res == 0);
>> 	io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
>> 	io_set_eventfd(&iocb, efd);
>> 	res = io_submit(io_ctx, 1, iocbs);		assert(res == 1);
>
> yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c

I'm not sure why you think io_submit is blocking.  In my setup, I
preallocated the file, and the test code opens it with O_DIRECT.  So,
io_submit should return after the dio is issued, and the I/O size is
large enough that it should still be outstanding when io_submit returns.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-18 15:25                                       ` Jeff Moyer
@ 2009-03-18 15:43                                         ` Eric Dumazet
  2009-03-18 16:13                                           ` Jeff Moyer
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Dumazet @ 2009-03-18 15:43 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Davide Libenzi, Linux Kernel Mailing List, Benjamin LaHaise,
	Trond Myklebust, Andrew Morton, linux-aio, zach.brown

Jeff Moyer a écrit :
> Eric Dumazet <dada1@cosmosbay.com> writes:
> 
> 
>>> 	rwfd = open("rwfile", O_RDWR|O_DIRECT);		assert(rwfd != -1);
>>> 	if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
>>> 		perror("posix_memalign");
>>> 		exit(1);
>>> 	}
>>> 	memset(buf, 0x42, SIZE);
>>>
>>> 	/* Write test. */
>>> 	res = io_queue_init(1024, &io_ctx);		assert(res == 0);
>>> 	io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
>>> 	io_set_eventfd(&iocb, efd);
>>> 	res = io_submit(io_ctx, 1, iocbs);		assert(res == 1);
>> yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c
> 
> I'm not sure why you think io_submit is blocking.  In my setup, I
> preallocated the file, and the test code opens it with O_DIRECT.  So,
> io_submit should return after the dio is issued, and the I/O size is
> large enough that it should still be outstanding when io_submit returns.

Hmm.. io_submit() is a blocking syscall, this is how I understood fs/aio.c

Then, using strace -tt -T on your program, I can confirm it is quite a long syscall (3.5 seconds,
about time needed to write a 256 MB file on my disk ;) )

16:33:15.861301 eventfd(0)              = 3 <0.000010>
16:33:15.861344 open("rwfile", O_RDWR|O_DIRECT) = 4 <0.000010>
16:33:15.861444 mmap(NULL, 268443648, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08eec55000 <0.000008>
16:33:16.100040 io_setup(1024, {139676618563584}) = 0 <0.000047>
16:33:16.100149 io_submit(139676618563584, 1, {...}) = 1 <3.558085>
16:33:19.658292 io_getevents(139676618563584, 1, 1, {...}NULL) = 1 <0.258104>
16:33:19.916486 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0 <0.000007>
16:33:19.916559 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08ff3b9000 <0.000009>
16:33:19.916607 write(1, "eventfd write test [SUCCESS]\n"..., 29eventfd write test [SUCCESS]
) = 29 <0.000008>
16:33:19.916653 exit_group(0)           = ?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (2nd rev)
  2009-03-18 15:43                                         ` Eric Dumazet
@ 2009-03-18 16:13                                           ` Jeff Moyer
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-18 16:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Linux Kernel Mailing List, Benjamin LaHaise,
	Trond Myklebust, Andrew Morton, linux-aio, zach.brown

Eric Dumazet <dada1@cosmosbay.com> writes:

> Jeff Moyer a écrit :
>> Eric Dumazet <dada1@cosmosbay.com> writes:
>> 
>> 
>>>> 	rwfd = open("rwfile", O_RDWR|O_DIRECT);		assert(rwfd != -1);
>>>> 	if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
>>>> 		perror("posix_memalign");
>>>> 		exit(1);
>>>> 	}
>>>> 	memset(buf, 0x42, SIZE);
>>>>
>>>> 	/* Write test. */
>>>> 	res = io_queue_init(1024, &io_ctx);		assert(res == 0);
>>>> 	io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
>>>> 	io_set_eventfd(&iocb, efd);
>>>> 	res = io_submit(io_ctx, 1, iocbs);		assert(res == 1);
>>> yes but io_submit() is blocking. so your close(efd) will come after the release in fs/aio.c
>> 
>> I'm not sure why you think io_submit is blocking.  In my setup, I
>> preallocated the file, and the test code opens it with O_DIRECT.  So,
>> io_submit should return after the dio is issued, and the I/O size is
>> large enough that it should still be outstanding when io_submit returns.
>
> Hmm.. io_submit() is a blocking syscall, this is how I understood fs/aio.c

Hi, Eric,

The whole point of io_submit is to allow you to submit I/O without
waiting for it.  There are known cases where io_submit will block, of
course, such as when we run out of request descriptors.  See the
io_submit.stp script for some examples.[1]

Now, I admit I was testing using an SSD, so I didn't actually notice the
time it took for the 256MB write (!!!).  I tried the reproducer I posted
on my F9 box, and here is the output I get:

BUG: sleeping function called from invalid context at
fs/file_table.c:262
in_atomic():1, irqs_disabled():1
Pid: 0, comm: swapper Not tainted 2.6.27.15-78.2.23.fc9.x86_64 #1

Call Trace:
 <IRQ>  [<ffffffff8103892e>] __might_sleep+0xe7/0xec
 [<ffffffff810bfa86>] __fput+0x35/0x16d
 [<ffffffff810bfbd3>] fput+0x15/0x17
 [<ffffffff810d71bb>] really_put_req+0x34/0x9c
 [<ffffffff810d72f0>] __aio_put_req+0xcd/0xda
 [<ffffffff810d7f77>] aio_complete+0x15d/0x19f
 [<ffffffff810e7016>] dio_bio_end_aio+0x8e/0xa0
 [<ffffffff810e32ab>] bio_endio+0x2a/0x2c
 [<ffffffff8113beae>] req_bio_endio+0x9d/0xba
 [<ffffffff8113c073>] __end_that_request_first+0x1a8/0x2b5
 [<ffffffff8113cb89>] blk_end_io+0x2f/0xa9
 [<ffffffff8113cc2f>] blk_end_request+0xe/0x10
 [<ffffffffa005d30b>] scsi_end_request+0x30/0x90 [scsi_mod]
 [<ffffffffa005d9e9>] scsi_io_completion+0x1aa/0x3b3 [scsi_mod]
 [<ffffffffa0057658>] scsi_finish_command+0xde/0xe7 [scsi_mod]
 [<ffffffffa005de68>] scsi_softirq_done+0xe4/0xed [scsi_mod]
 [<ffffffff8113baa8>] blk_done_softirq+0x7e/0x8e
 [<ffffffff81045146>] __do_softirq+0x7e/0x10c
 [<ffffffff81011bfc>] call_softirq+0x1c/0x28
 [<ffffffff81012e06>] do_softirq+0x4d/0xb0
 [<ffffffff81044d1b>] irq_exit+0x4e/0x9d
 [<ffffffff81013122>] do_IRQ+0x147/0x169
 [<ffffffff81010963>] ret_from_intr+0x0/0x2e
 <EOI>  [<ffffffff810173a9>] ? mwait_idle+0x3e/0x4f
 [<ffffffff810173a0>] ? mwait_idle+0x35/0x4f
 [<ffffffff8100f2a7>] ? cpu_idle+0xb2/0x10b
 [<ffffffff812af35d>] ? rest_init+0x61/0x63

So, I think it is a valid reproducer as it stands.

> Then, using strace -tt -T on your program, I can confirm it is quite a long syscall (3.5 seconds,
> about time needed to write a 256 MB file on my disk ;) )

Did you preallocate the file?

Cheers,
Jeff

[1] http://sourceware.org/systemtap/wiki/ScriptsTools?action=AttachFile&do=view&target=io_submit.stp

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch] eventfd - remove fput() call from possible IRQ context (3rd rev)
  2009-03-18 14:22                                   ` Jeff Moyer
  2009-03-18 14:46                                     ` Davide Libenzi
  2009-03-18 14:55                                     ` Eric Dumazet
@ 2009-03-18 17:25                                     ` Davide Libenzi
  2009-03-18 17:34                                       ` Jeff Moyer
  2 siblings, 1 reply; 49+ messages in thread
From: Davide Libenzi @ 2009-03-18 17:25 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Benjamin LaHaise, Trond Myklebust, Andrew Morton, Eric Dumazet,
	linux-aio, Jeff Moyer, zach.brown

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4283 bytes --]

The following patch remove a possible source of fput() call from inside 
IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
from IRQ context, but Jeff said he was able to, with the attached test 
program. Independently from this, the bug is conceptually there, so we 
might be better off fixing it.
This patch adds an optimization similar to the one we already do on 
->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
in general, but the alternative here would be to add a brand new delayed 
fput() infrastructure, that I'm not sure is worth it.


On Sun, 15 Mar 2009, Benjamin LaHaise wrote:

> This looks reasonably sane, the only concern I have with it is that I think 
> it logically makes more sense to use the same convention for fi_filp and 
> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit 
> confusing.  Aside from that, it looks like it should fix the problem 
> correctly.

Makes sense.



Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide


---
 fs/aio.c |   37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c	2009-03-15 13:11:56.000000000 -0700
+++ linux-2.6.mod/fs/aio.c	2009-03-18 10:18:21.000000000 -0700
@@ -443,7 +443,7 @@ static struct kiocb *__aio_get_req(struc
 	req->private = NULL;
 	req->ki_iovec = NULL;
 	INIT_LIST_HEAD(&req->ki_run_list);
-	req->ki_eventfd = ERR_PTR(-EINVAL);
+	req->ki_eventfd = NULL;
 
 	/* Check if the completion queue has enough free space to
 	 * accept an event from this io.
@@ -485,8 +485,6 @@ static inline void really_put_req(struct
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
-	if (!IS_ERR(req->ki_eventfd))
-		fput(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -508,8 +506,11 @@ static void aio_fput_routine(struct work
 		list_del(&req->ki_list);
 		spin_unlock_irq(&fput_lock);
 
-		/* Complete the fput */
-		__fput(req->ki_filp);
+		/* Complete the fput(s) */
+		if (req->ki_filp != NULL)
+			__fput(req->ki_filp);
+		if (req->ki_eventfd != NULL)
+			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -527,12 +528,14 @@ static void aio_fput_routine(struct work
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
+	int schedule_putreq = 0;
+
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
 	assert_spin_locked(&ctx->ctx_lock);
 
-	req->ki_users --;
+	req->ki_users--;
 	BUG_ON(req->ki_users < 0);
 	if (likely(req->ki_users))
 		return 0;
@@ -540,10 +543,23 @@ static int __aio_put_req(struct kioctx *
 	req->ki_cancel = NULL;
 	req->ki_retry = NULL;
 
-	/* Must be done under the lock to serialise against cancellation.
-	 * Call this aio_fput as it duplicates fput via the fput_work.
+	/*
+	 * Try to optimize the aio and eventfd file* puts, by avoiding to
+	 * schedule work in case it is not __fput() time. In normal cases,
+	 * we would not be holding the last reference to the file*, so
+	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
+		schedule_putreq++;
+	else
+		req->ki_filp = NULL;
+	if (req->ki_eventfd != NULL) {
+		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
+			schedule_putreq++;
+		else
+			req->ki_eventfd = NULL;
+	}
+	if (unlikely(schedule_putreq)) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
@@ -1009,7 +1025,7 @@ int aio_complete(struct kiocb *iocb, lon
 	 * eventfd. The eventfd_signal() function is safe to be called
 	 * from IRQ context.
 	 */
-	if (!IS_ERR(iocb->ki_eventfd))
+	if (iocb->ki_eventfd != NULL)
 		eventfd_signal(iocb->ki_eventfd, 1);
 
 put_rq:
@@ -1608,6 +1624,7 @@ static int io_submit_one(struct kioctx *
 		req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
 		if (IS_ERR(req->ki_eventfd)) {
 			ret = PTR_ERR(req->ki_eventfd);
+			req->ki_eventfd = NULL;
 			goto out_put_req;
 		}
 	}

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-csrc; name=eventfd-irqctx-trigger.c, Size: 1554 bytes --]


#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/types.h>
#include <errno.h>
#include <assert.h>
#include <sys/eventfd.h>
#include <libaio.h>

int
main(int argc, char **argv)
{
#define SIZE	(256*1024*1024)
	char *buf;
	struct io_event io_event;
	struct iocb iocb;
	struct iocb *iocbs[] = { &iocb };
	int rwfd, efd;
	int res;
	io_context_t	io_ctx;

	efd = eventfd(0, 0);
	if (efd < 0) {
		perror("eventfd");
		exit(1);
	}

	rwfd = open("rwfile", O_RDWR|O_DIRECT);		assert(rwfd != -1);
	if (posix_memalign((void **)&buf, getpagesize(), SIZE) < 0) {
		perror("posix_memalign");
		exit(1);
	}
	memset(buf, 0x42, SIZE);

	/* Write test. */
	res = io_queue_init(1024, &io_ctx);		assert(res == 0);
	io_prep_pwrite(&iocb, rwfd, buf, SIZE, 0);
	io_set_eventfd(&iocb, efd);
	res = io_submit(io_ctx, 1, iocbs);		assert(res == 1);

	/* Now close the eventfd so that AIO has the last reference */
	close(efd);

	/* Keep this process around so that the aio subsystem does not hold
	 * the last reference on the rwfd, otherwise the really_put_req will
	 * be called from process context */
	res = io_getevents(io_ctx, 1, 1, &io_event, NULL);
	if (res != 1) {
		if (res < 0) {
			errno = -res;
			perror("io_getevents");
		} else
			printf("io_getevents did not return 1 event after "
			       "closing eventfd\n");
		exit(1);
	}
	assert(io_event.res == SIZE);
	printf("eventfd write test [SUCCESS]\n");

	return 0;
}


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch] eventfd - remove fput() call from possible IRQ context (3rd rev)
  2009-03-18 17:25                                     ` [patch] eventfd - remove fput() call from possible IRQ context (3rd rev) Davide Libenzi
@ 2009-03-18 17:34                                       ` Jeff Moyer
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Moyer @ 2009-03-18 17:34 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Benjamin LaHaise, Trond Myklebust,
	Andrew Morton, Eric Dumazet, linux-aio, zach.brown

Davide Libenzi <davidel@xmailserver.org> writes:

> The following patch remove a possible source of fput() call from inside 
> IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call 
> from IRQ context, but Jeff said he was able to, with the attached test 
> program. Independently from this, the bug is conceptually there, so we 
> might be better off fixing it.
> This patch adds an optimization similar to the one we already do on 
> ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty 
> in general, but the alternative here would be to add a brand new delayed 
> fput() infrastructure, that I'm not sure is worth it.
>
>
> On Sun, 15 Mar 2009, Benjamin LaHaise wrote:
>
>> This looks reasonably sane, the only concern I have with it is that I think 
>> it logically makes more sense to use the same convention for fi_filp and 
>> ki_eventfd, as the different in IS_ERR vs checking for NULL is a bit 
>> confusing.  Aside from that, it looks like it should fix the problem 
>> correctly.
>
> Makes sense.
>
>
>
> Signed-off-by: Davide Libenzi <davidel@xmailserver.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2009-03-18 17:35 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-09 15:49 [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Jeff Moyer
2009-03-09 15:54 ` [patch] factor out checks against the memlock rlimit Jeff Moyer
2009-03-09 15:59 ` [patch] man-pages: add documentation about the memlock implications of io_setup Jeff Moyer
2009-03-09 16:45   ` Michael Kerrisk
2009-03-09 16:48   ` Michael Kerrisk
2009-03-09 20:44     ` Jeff Moyer
2009-03-09 16:18 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Avi Kivity
2009-03-09 17:57   ` Jeff Moyer
2009-03-09 19:45     ` Avi Kivity
2009-03-09 20:36       ` Jamie Lokier
2009-03-10  8:36         ` Avi Kivity
2009-03-09 20:31     ` Eric Dumazet
2009-03-12  2:39       ` Eric Dumazet
2009-03-12  2:44         ` Benjamin LaHaise
2009-03-12  3:24           ` Eric Dumazet
2009-03-12  3:29             ` Benjamin LaHaise
2009-03-12  3:33               ` Eric Dumazet
2009-03-12  3:36                 ` Benjamin LaHaise
2009-03-12  3:40                   ` Eric Dumazet
2009-03-12  3:09         ` Eric Dumazet
2009-03-12  5:18           ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
2009-03-12  5:42             ` [PATCH] aio: " Eric Dumazet
2009-03-12  5:47             ` [PATCH] fs: " Andrew Morton
2009-03-12  6:10               ` Eric Dumazet
2009-03-12  6:39                 ` Andrew Morton
2009-03-12 13:39                   ` Davide Libenzi
2009-03-13 22:34                     ` Davide Libenzi
2009-03-13 22:43                       ` Eric Dumazet
2009-03-13 23:28                     ` Trond Myklebust
2009-03-14  1:40                       ` Davide Libenzi
2009-03-14  4:02                         ` Trond Myklebust
2009-03-14 14:32                           ` Davide Libenzi
2009-03-15  1:36                             ` [patch] eventfd - remove fput() call from possible IRQ context Davide Libenzi
2009-03-15 17:44                               ` Benjamin LaHaise
2009-03-15 20:08                                 ` [patch] eventfd - remove fput() call from possible IRQ context (2nd rev) Davide Libenzi
2009-03-16 17:25                                   ` Jamie Lokier
2009-03-16 18:36                                     ` Davide Libenzi
2009-03-18 14:22                                   ` Jeff Moyer
2009-03-18 14:46                                     ` Davide Libenzi
2009-03-18 14:55                                     ` Eric Dumazet
2009-03-18 15:25                                       ` Jeff Moyer
2009-03-18 15:43                                         ` Eric Dumazet
2009-03-18 16:13                                           ` Jeff Moyer
2009-03-18 17:25                                     ` [patch] eventfd - remove fput() call from possible IRQ context (3rd rev) Davide Libenzi
2009-03-18 17:34                                       ` Jeff Moyer
2009-03-12 19:22                   ` [PATCH] fs: fput() can be called from interrupt context Eric Dumazet
2009-03-12 20:21                     ` Andrew Morton
2009-03-09 22:36 ` [patch] aio: remove aio-max-nr and instead use the memlock rlimit to limit the number of pages pinned for the aio completion ring Andrew Morton
2009-03-10 13:43   ` Jeff Moyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.