All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] aio: thread (work queue) based aio and new aio functionality
@ 2016-01-11 22:06 ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Hello all,

First off, sorry for the wide reaching To and Cc, but this patch series
touches the core kernel and also reaches across subsystems a bit.  If
some of the people who read this can provide review feedback, I would
very much appreciate it.

This series introduces new AIO functionality to make use of kernel
threads (by way of queue_work()) to implement additional asynchronous
operations.  The work came about as the result of various tuning done to
the kernel for my employer (Solace Systems) that we ship in our
products.

First off, the benefits: using kernel threads to implement AIO
functionality has a significant benefit in our application.  Compared to
a user space thread pool based AIO implementation, we see roughly a 25%
performance improvement in our application by using this new kernel
based AIO functionality.  This comes about as a consequence of fewer
context switches, fewer transitions to/from userspace, and the ability
to make certain optimizations in the kernel that are otherwise
impossible in userspace (ie the new readahead functionality).

Now the downsides: when using queue_work(), code executes in the context
of a different task that the submitter of the operation.  This means
that there are significant security concerns if there are any bugs in
the code that sets up the appropriate security credentials and related
context in struct task.  There may well be DoS bugs in this
implementation which have yet to be discovered.

Given the benefits, I am of the opinion that this patch series is a
useful addition to the kernel.  Since this code will be experimental for
some period of time as the interactions with other subsystems are
reviewed and tested, I have implemented a config option to allow for
this code to be compiled out and a sysctl (fs.aio-auto-threads) that
must be explicitly set to 1 before this new functionality is available
to userspace.  Hopefully this is enough to address the security concerns
during the growing pains and allow other developers to safely explore
the new functionality.

Caveats: the existing O_DIRECT AIO code path is currently bypassed when
the new thread helpers are enabled.  I plan to do additional work in
this area, but the fact that the dio code can block under certain
conditions is not acceptable to the applications I am working on, as it
leads to starvation of other requests the system is processing.  That
said, this is what's ready today, and I hope that people can provide
feedback to help drive further improvements.

I will be posting further documentation and test cases later this week
for people to experiment with, but for those looking for a few test
programs to exercise the new functionality, there is a collection of
code at git://git.kvack.org/aio-testprogs.git/ .  Getting the code
cleaned up from the internal implementation to something that is in
reasonable condition for submission ended up taking longer than
expected.  Thankfully, this kernel cycle lines up with some internal QA
work, so there should be additional testing taking place over the next
couple of months.

Also, the libaio test harness has some bugs that the new functionality
revealed.  A version with fixes for those tests can be fetched from
git://git.kvack.org/~bcrl/libaio.git/ .  Wrappers for the new IOCB_CMD
types should be posted there by the end of the day.

Some notes on the new functionality: all operations are cancellable
providing the kernel subsystem involved aborts operations when delivered
a SIGKILL.  This ensures that async operations on pipe and sockets are
cancelled when the process that issued the operations exits.  A couple
of the test programs exercise this functionality on pipes.

Signal handling is slightly impacted by this AIO functionality.
Specifically, the first patch in the series introduces a new helper,
io_send_sig() that delivers a signal intended for the performer of an io
operation.  This is used to deliver signals like SIGXFS and SIGPIPE.  It
is a straightforward replacement of send_sig(SIGXXX, current, 0) to
io_send_sig(SIGXXX). 

As always, comments, bug reports and feedback are appreciated.
Developers looking for a git pull can find one at
git://git.kvack.org/aio-next.git/ .  Cheers!

		-ben

Benjamin LaHaise (13):
  signals: distinguish signals sent due to i/o via io_send_sig()
  aio: add aio_get_mm() helper
  aio: for async operations, make the iter argument persistent
  signals: add and use aio_get_task() to direct signals sent via
    io_send_sig()
  fs: make do_loop_readv_writev() non-static
  aio: add queue_work() based threaded aio support
  aio: enabled thread based async fsync
  aio: add support for aio poll via aio thread helper
  aio: add support for async openat()
  aio: add async unlinkat functionality
  mm: enable __do_page_cache_readahead() to include present pages
  aio: add support for aio readahead
  aio: add support for aio renameat operation

 drivers/gpu/drm/drm_lock.c     |   2 +-
 drivers/gpu/drm/ttm/ttm_lock.c |   6 +-
 fs/aio.c                       | 727 ++++++++++++++++++++++++++++++++++++++---
 fs/attr.c                      |   2 +-
 fs/binfmt_flat.c               |   2 +-
 fs/fuse/dev.c                  |   2 +-
 fs/internal.h                  |   6 +
 fs/namei.c                     |   2 +-
 fs/pipe.c                      |   4 +-
 fs/read_write.c                |   5 +-
 fs/splice.c                    |   8 +-
 include/linux/aio.h            |   9 +
 include/linux/fs.h             |   3 +
 include/linux/sched.h          |   6 +
 include/uapi/linux/aio_abi.h   |  15 +-
 init/Kconfig                   |  13 +
 kernel/auditsc.c               |   6 +-
 kernel/signal.c                |  20 ++
 kernel/sysctl.c                |   9 +
 mm/filemap.c                   |   6 +-
 mm/internal.h                  |   4 +-
 mm/readahead.c                 |  13 +-
 net/atm/common.c               |   4 +-
 net/ax25/af_ax25.c             |   2 +-
 net/caif/caif_socket.c         |   2 +-
 net/core/stream.c              |   2 +-
 net/decnet/af_decnet.c         |   2 +-
 net/irda/af_irda.c             |   4 +-
 net/netrom/af_netrom.c         |   2 +-
 net/rose/af_rose.c             |   2 +-
 net/sctp/socket.c              |   2 +-
 net/unix/af_unix.c             |   4 +-
 net/x25/af_x25.c               |   2 +-
 33 files changed, 817 insertions(+), 81 deletions(-)

-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH 00/13] aio: thread (work queue) based aio and new aio functionality
@ 2016-01-11 22:06 ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Hello all,

First off, sorry for the wide reaching To and Cc, but this patch series
touches the core kernel and also reaches across subsystems a bit.  If
some of the people who read this can provide review feedback, I would
very much appreciate it.

This series introduces new AIO functionality to make use of kernel
threads (by way of queue_work()) to implement additional asynchronous
operations.  The work came about as the result of various tuning done to
the kernel for my employer (Solace Systems) that we ship in our
products.

First off, the benefits: using kernel threads to implement AIO
functionality has a significant benefit in our application.  Compared to
a user space thread pool based AIO implementation, we see roughly a 25%
performance improvement in our application by using this new kernel
based AIO functionality.  This comes about as a consequence of fewer
context switches, fewer transitions to/from userspace, and the ability
to make certain optimizations in the kernel that are otherwise
impossible in userspace (ie the new readahead functionality).

Now the downsides: when using queue_work(), code executes in the context
of a different task that the submitter of the operation.  This means
that there are significant security concerns if there are any bugs in
the code that sets up the appropriate security credentials and related
context in struct task.  There may well be DoS bugs in this
implementation which have yet to be discovered.

Given the benefits, I am of the opinion that this patch series is a
useful addition to the kernel.  Since this code will be experimental for
some period of time as the interactions with other subsystems are
reviewed and tested, I have implemented a config option to allow for
this code to be compiled out and a sysctl (fs.aio-auto-threads) that
must be explicitly set to 1 before this new functionality is available
to userspace.  Hopefully this is enough to address the security concerns
during the growing pains and allow other developers to safely explore
the new functionality.

Caveats: the existing O_DIRECT AIO code path is currently bypassed when
the new thread helpers are enabled.  I plan to do additional work in
this area, but the fact that the dio code can block under certain
conditions is not acceptable to the applications I am working on, as it
leads to starvation of other requests the system is processing.  That
said, this is what's ready today, and I hope that people can provide
feedback to help drive further improvements.

I will be posting further documentation and test cases later this week
for people to experiment with, but for those looking for a few test
programs to exercise the new functionality, there is a collection of
code at git://git.kvack.org/aio-testprogs.git/ .  Getting the code
cleaned up from the internal implementation to something that is in
reasonable condition for submission ended up taking longer than
expected.  Thankfully, this kernel cycle lines up with some internal QA
work, so there should be additional testing taking place over the next
couple of months.

Also, the libaio test harness has some bugs that the new functionality
revealed.  A version with fixes for those tests can be fetched from
git://git.kvack.org/~bcrl/libaio.git/ .  Wrappers for the new IOCB_CMD
types should be posted there by the end of the day.

Some notes on the new functionality: all operations are cancellable
providing the kernel subsystem involved aborts operations when delivered
a SIGKILL.  This ensures that async operations on pipe and sockets are
cancelled when the process that issued the operations exits.  A couple
of the test programs exercise this functionality on pipes.

Signal handling is slightly impacted by this AIO functionality.
Specifically, the first patch in the series introduces a new helper,
io_send_sig() that delivers a signal intended for the performer of an io
operation.  This is used to deliver signals like SIGXFS and SIGPIPE.  It
is a straightforward replacement of send_sig(SIGXXX, current, 0) to
io_send_sig(SIGXXX). 

As always, comments, bug reports and feedback are appreciated.
Developers looking for a git pull can find one at
git://git.kvack.org/aio-next.git/ .  Cheers!

		-ben

Benjamin LaHaise (13):
  signals: distinguish signals sent due to i/o via io_send_sig()
  aio: add aio_get_mm() helper
  aio: for async operations, make the iter argument persistent
  signals: add and use aio_get_task() to direct signals sent via
    io_send_sig()
  fs: make do_loop_readv_writev() non-static
  aio: add queue_work() based threaded aio support
  aio: enabled thread based async fsync
  aio: add support for aio poll via aio thread helper
  aio: add support for async openat()
  aio: add async unlinkat functionality
  mm: enable __do_page_cache_readahead() to include present pages
  aio: add support for aio readahead
  aio: add support for aio renameat operation

 drivers/gpu/drm/drm_lock.c     |   2 +-
 drivers/gpu/drm/ttm/ttm_lock.c |   6 +-
 fs/aio.c                       | 727 ++++++++++++++++++++++++++++++++++++++---
 fs/attr.c                      |   2 +-
 fs/binfmt_flat.c               |   2 +-
 fs/fuse/dev.c                  |   2 +-
 fs/internal.h                  |   6 +
 fs/namei.c                     |   2 +-
 fs/pipe.c                      |   4 +-
 fs/read_write.c                |   5 +-
 fs/splice.c                    |   8 +-
 include/linux/aio.h            |   9 +
 include/linux/fs.h             |   3 +
 include/linux/sched.h          |   6 +
 include/uapi/linux/aio_abi.h   |  15 +-
 init/Kconfig                   |  13 +
 kernel/auditsc.c               |   6 +-
 kernel/signal.c                |  20 ++
 kernel/sysctl.c                |   9 +
 mm/filemap.c                   |   6 +-
 mm/internal.h                  |   4 +-
 mm/readahead.c                 |  13 +-
 net/atm/common.c               |   4 +-
 net/ax25/af_ax25.c             |   2 +-
 net/caif/caif_socket.c         |   2 +-
 net/core/stream.c              |   2 +-
 net/decnet/af_decnet.c         |   2 +-
 net/irda/af_irda.c             |   4 +-
 net/netrom/af_netrom.c         |   2 +-
 net/rose/af_rose.c             |   2 +-
 net/sctp/socket.c              |   2 +-
 net/unix/af_unix.c             |   4 +-
 net/x25/af_x25.c               |   2 +-
 33 files changed, 817 insertions(+), 81 deletions(-)

-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH 01/13] signals: distinguish signals sent due to i/o via io_send_sig()
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:06   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

In preparation for thread based aio support, make the callers of
send_sig() that are sending a signal as a direct consequence of a
read or write operation (typically for SIGPIPE or SIGXFS) use a
separate helper of io_send_sig().  This will make it possible for
the thread based aio operations to direct these signals to the
process that actually submitted the aio request.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 drivers/gpu/drm/drm_lock.c     |  2 +-
 drivers/gpu/drm/ttm/ttm_lock.c |  6 +++---
 fs/attr.c                      |  2 +-
 fs/binfmt_flat.c               |  2 +-
 fs/fuse/dev.c                  |  2 +-
 fs/pipe.c                      |  4 ++--
 fs/splice.c                    |  8 ++++----
 include/linux/sched.h          |  1 +
 kernel/auditsc.c               |  6 +++---
 kernel/signal.c                | 14 ++++++++++++++
 mm/filemap.c                   |  6 ++++--
 net/atm/common.c               |  4 ++--
 net/ax25/af_ax25.c             |  2 +-
 net/caif/caif_socket.c         |  2 +-
 net/core/stream.c              |  2 +-
 net/decnet/af_decnet.c         |  2 +-
 net/irda/af_irda.c             |  4 ++--
 net/netrom/af_netrom.c         |  2 +-
 net/rose/af_rose.c             |  2 +-
 net/sctp/socket.c              |  2 +-
 net/unix/af_unix.c             |  4 ++--
 net/x25/af_x25.c               |  2 +-
 22 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/drm_lock.c b/drivers/gpu/drm/drm_lock.c
index daa2ff1..3565563 100644
--- a/drivers/gpu/drm/drm_lock.c
+++ b/drivers/gpu/drm/drm_lock.c
@@ -83,7 +83,7 @@ int drm_legacy_lock(struct drm_device *dev, void *data,
 		__set_current_state(TASK_INTERRUPTIBLE);
 		if (!master->lock.hw_lock) {
 			/* Device has been unregistered */
-			send_sig(SIGTERM, current, 0);
+			io_send_sig(SIGTERM);
 			ret = -EINTR;
 			break;
 		}
diff --git a/drivers/gpu/drm/ttm/ttm_lock.c b/drivers/gpu/drm/ttm/ttm_lock.c
index f154fb1..816be91 100644
--- a/drivers/gpu/drm/ttm/ttm_lock.c
+++ b/drivers/gpu/drm/ttm/ttm_lock.c
@@ -68,7 +68,7 @@ static bool __ttm_read_lock(struct ttm_lock *lock)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
@@ -101,7 +101,7 @@ static bool __ttm_read_trylock(struct ttm_lock *lock, bool *locked)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
@@ -151,7 +151,7 @@ static bool __ttm_write_lock(struct ttm_lock *lock)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
diff --git a/fs/attr.c b/fs/attr.c
index 6530ced..0c63049 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -118,7 +118,7 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 
 	return 0;
 out_sig:
-	send_sig(SIGXFSZ, current, 0);
+	io_send_sig(SIGXFSZ);
 out_big:
 	return -EFBIG;
 }
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index f723cd3..51cf839 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -373,7 +373,7 @@ calc_reloc(unsigned long r, struct lib_info *p, int curid, int internalp)
 
 failed:
 	printk(", killing %s!\n", current->comm);
-	send_sig(SIGSEGV, current, 0);
+	io_send_sig(SIGSEGV);
 
 	return RELOC_FAILED;
 }
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ebb5e37..20ffc52 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1391,7 +1391,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	pipe_lock(pipe);
 
 	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		if (!ret)
 			ret = -EPIPE;
 		goto out_unlock;
diff --git a/fs/pipe.c b/fs/pipe.c
index 42cf8dd..e55ed9a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -351,7 +351,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	__pipe_lock(pipe);
 
 	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		ret = -EPIPE;
 		goto out;
 	}
@@ -386,7 +386,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		int bufs;
 
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
diff --git a/fs/splice.c b/fs/splice.c
index 4cf700d..336db78 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -193,7 +193,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 
 	for (;;) {
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
@@ -1769,7 +1769,7 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 
 	while (pipe->nrbufs >= pipe->buffers) {
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			ret = -EPIPE;
 			break;
 		}
@@ -1820,7 +1820,7 @@ retry:
 
 	do {
 		if (!opipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
@@ -1924,7 +1924,7 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 
 	do {
 		if (!opipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..6376d58 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2502,6 +2502,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
+extern int io_send_sig(int signal);
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index b86cc04..8b4a3ea 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1025,7 +1025,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 	 * any.
 	 */
 	if (WARN_ON_ONCE(len < 0 || len > MAX_ARG_STRLEN - 1)) {
-		send_sig(SIGKILL, current, 0);
+		io_send_sig(SIGKILL);
 		return -1;
 	}
 
@@ -1043,7 +1043,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 		 */
 		if (ret) {
 			WARN_ON(1);
-			send_sig(SIGKILL, current, 0);
+			io_send_sig(SIGKILL);
 			return -1;
 		}
 		buf[to_send] = '\0';
@@ -1107,7 +1107,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 			ret = 0;
 		if (ret) {
 			WARN_ON(1);
-			send_sig(SIGKILL, current, 0);
+			io_send_sig(SIGKILL);
 			return -1;
 		}
 		buf[to_send] = '\0';
diff --git a/kernel/signal.c b/kernel/signal.c
index f3f1f7a..7c14cb4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1422,6 +1422,20 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	return do_send_sig_info(sig, info, p, false);
 }
 
+/* io_send_sig: send a signal caused by an i/o operation
+ *
+ * Use this helper when a signal is being sent to the task that is responsible
+ * for aer initiated operation.  Most commonly this is used to send signals
+ * like SIGPIPE or SIGXFS that are the result of attempting a read or write
+ * operation.  This is used by aio to direct a signal to the correct task in
+ * the case of async operations.
+ */
+int io_send_sig(int sig)
+{
+	return send_sig(sig, current, 0);
+}
+EXPORT_SYMBOL(io_send_sig);
+
 #define __si_special(priv) \
 	((priv) ? SEND_SIG_PRIV : SEND_SIG_NOINFO)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1bb0076..089ccd85 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2343,7 +2343,7 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 
 	if (limit != RLIM_INFINITY) {
 		if (iocb->ki_pos >= limit) {
-			send_sig(SIGXFSZ, current, 0);
+			io_send_sig(SIGXFSZ);
 			return -EFBIG;
 		}
 		iov_iter_truncate(from, limit - (unsigned long)pos);
@@ -2354,8 +2354,10 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	 */
 	if (unlikely(pos + iov_iter_count(from) > MAX_NON_LFS &&
 				!(file->f_flags & O_LARGEFILE))) {
-		if (pos >= MAX_NON_LFS)
+		if (pos >= MAX_NON_LFS) {
+			io_send_sig(SIGXFSZ);
 			return -EFBIG;
+		}
 		iov_iter_truncate(from, MAX_NON_LFS - (unsigned long)pos);
 	}
 
diff --git a/net/atm/common.c b/net/atm/common.c
index 49a872d..3eef736 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -591,7 +591,7 @@ int vcc_sendmsg(struct socket *sock, struct msghdr *m, size_t size)
 	    test_bit(ATM_VF_CLOSE, &vcc->flags) ||
 	    !test_bit(ATM_VF_READY, &vcc->flags)) {
 		error = -EPIPE;
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 	if (!size) {
@@ -620,7 +620,7 @@ int vcc_sendmsg(struct socket *sock, struct msghdr *m, size_t size)
 		    test_bit(ATM_VF_CLOSE, &vcc->flags) ||
 		    !test_bit(ATM_VF_READY, &vcc->flags)) {
 			error = -EPIPE;
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			break;
 		}
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index fbd0acf..8dfd84c 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1457,7 +1457,7 @@ static int ax25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index aa209b1..ba8d8e2 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -663,7 +663,7 @@ static int caif_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 
 pipe_err:
 	if (sent == 0 && !(msg->msg_flags&MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	err = -EPIPE;
 out_err:
 	return sent ? : err;
diff --git a/net/core/stream.c b/net/core/stream.c
index b96f7a7..6b24f6d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -182,7 +182,7 @@ int sk_stream_error(struct sock *sk, int flags, int err)
 	if (err == -EPIPE)
 		err = sock_error(sk) ? : -EPIPE;
 	if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	return err;
 }
 EXPORT_SYMBOL(sk_stream_error);
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 13d6b1a..47ca404 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -1954,7 +1954,7 @@ static int dn_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
 		err = -EPIPE;
 		if (!(flags & MSG_NOSIGNAL))
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 		goto out_err;
 	}
 
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 923abd6..f9c6b55 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -1539,7 +1539,7 @@ static int irda_sendmsg_dgram(struct socket *sock, struct msghdr *msg,
 	lock_sock(sk);
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
@@ -1622,7 +1622,7 @@ static int irda_sendmsg_ultra(struct socket *sock, struct msghdr *msg,
 
 	err = -EPIPE;
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index ed212ff..b5eaecc 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1044,7 +1044,7 @@ static int nr_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index 129d357..954725c 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1065,7 +1065,7 @@ static int rose_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		return -EADDRNOTAVAIL;
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		return -EPIPE;
 	}
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ef1d90f..75bb437 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1556,7 +1556,7 @@ static int sctp_error(struct sock *sk, int flags, int err)
 	if (err == -EPIPE)
 		err = sock_error(sk) ? : -EPIPE;
 	if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	return err;
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a4631477..a1d5cf8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1909,7 +1909,7 @@ pipe_err_free:
 	kfree_skb(skb);
 pipe_err:
 	if (sent == 0 && !(msg->msg_flags&MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	err = -EPIPE;
 out_err:
 	scm_destroy(&scm);
@@ -2026,7 +2026,7 @@ err_unlock:
 err:
 	kfree_skb(newskb);
 	if (send_sigpipe && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	if (!init_scm)
 		scm_destroy(&scm);
 	return err;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index a750f33..102dd03 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1103,7 +1103,7 @@ static int x25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 
 	rc = -EPIPE;
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 01/13] signals: distinguish signals sent due to i/o via io_send_sig()
@ 2016-01-11 22:06   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

In preparation for thread based aio support, make the callers of
send_sig() that are sending a signal as a direct consequence of a
read or write operation (typically for SIGPIPE or SIGXFS) use a
separate helper of io_send_sig().  This will make it possible for
the thread based aio operations to direct these signals to the
process that actually submitted the aio request.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 drivers/gpu/drm/drm_lock.c     |  2 +-
 drivers/gpu/drm/ttm/ttm_lock.c |  6 +++---
 fs/attr.c                      |  2 +-
 fs/binfmt_flat.c               |  2 +-
 fs/fuse/dev.c                  |  2 +-
 fs/pipe.c                      |  4 ++--
 fs/splice.c                    |  8 ++++----
 include/linux/sched.h          |  1 +
 kernel/auditsc.c               |  6 +++---
 kernel/signal.c                | 14 ++++++++++++++
 mm/filemap.c                   |  6 ++++--
 net/atm/common.c               |  4 ++--
 net/ax25/af_ax25.c             |  2 +-
 net/caif/caif_socket.c         |  2 +-
 net/core/stream.c              |  2 +-
 net/decnet/af_decnet.c         |  2 +-
 net/irda/af_irda.c             |  4 ++--
 net/netrom/af_netrom.c         |  2 +-
 net/rose/af_rose.c             |  2 +-
 net/sctp/socket.c              |  2 +-
 net/unix/af_unix.c             |  4 ++--
 net/x25/af_x25.c               |  2 +-
 22 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/drm_lock.c b/drivers/gpu/drm/drm_lock.c
index daa2ff1..3565563 100644
--- a/drivers/gpu/drm/drm_lock.c
+++ b/drivers/gpu/drm/drm_lock.c
@@ -83,7 +83,7 @@ int drm_legacy_lock(struct drm_device *dev, void *data,
 		__set_current_state(TASK_INTERRUPTIBLE);
 		if (!master->lock.hw_lock) {
 			/* Device has been unregistered */
-			send_sig(SIGTERM, current, 0);
+			io_send_sig(SIGTERM);
 			ret = -EINTR;
 			break;
 		}
diff --git a/drivers/gpu/drm/ttm/ttm_lock.c b/drivers/gpu/drm/ttm/ttm_lock.c
index f154fb1..816be91 100644
--- a/drivers/gpu/drm/ttm/ttm_lock.c
+++ b/drivers/gpu/drm/ttm/ttm_lock.c
@@ -68,7 +68,7 @@ static bool __ttm_read_lock(struct ttm_lock *lock)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
@@ -101,7 +101,7 @@ static bool __ttm_read_trylock(struct ttm_lock *lock, bool *locked)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
@@ -151,7 +151,7 @@ static bool __ttm_write_lock(struct ttm_lock *lock)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
diff --git a/fs/attr.c b/fs/attr.c
index 6530ced..0c63049 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -118,7 +118,7 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 
 	return 0;
 out_sig:
-	send_sig(SIGXFSZ, current, 0);
+	io_send_sig(SIGXFSZ);
 out_big:
 	return -EFBIG;
 }
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index f723cd3..51cf839 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -373,7 +373,7 @@ calc_reloc(unsigned long r, struct lib_info *p, int curid, int internalp)
 
 failed:
 	printk(", killing %s!\n", current->comm);
-	send_sig(SIGSEGV, current, 0);
+	io_send_sig(SIGSEGV);
 
 	return RELOC_FAILED;
 }
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ebb5e37..20ffc52 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1391,7 +1391,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	pipe_lock(pipe);
 
 	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		if (!ret)
 			ret = -EPIPE;
 		goto out_unlock;
diff --git a/fs/pipe.c b/fs/pipe.c
index 42cf8dd..e55ed9a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -351,7 +351,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	__pipe_lock(pipe);
 
 	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		ret = -EPIPE;
 		goto out;
 	}
@@ -386,7 +386,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		int bufs;
 
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
diff --git a/fs/splice.c b/fs/splice.c
index 4cf700d..336db78 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -193,7 +193,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 
 	for (;;) {
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
@@ -1769,7 +1769,7 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 
 	while (pipe->nrbufs >= pipe->buffers) {
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			ret = -EPIPE;
 			break;
 		}
@@ -1820,7 +1820,7 @@ retry:
 
 	do {
 		if (!opipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
@@ -1924,7 +1924,7 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 
 	do {
 		if (!opipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..6376d58 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2502,6 +2502,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
+extern int io_send_sig(int signal);
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index b86cc04..8b4a3ea 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1025,7 +1025,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 	 * any.
 	 */
 	if (WARN_ON_ONCE(len < 0 || len > MAX_ARG_STRLEN - 1)) {
-		send_sig(SIGKILL, current, 0);
+		io_send_sig(SIGKILL);
 		return -1;
 	}
 
@@ -1043,7 +1043,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 		 */
 		if (ret) {
 			WARN_ON(1);
-			send_sig(SIGKILL, current, 0);
+			io_send_sig(SIGKILL);
 			return -1;
 		}
 		buf[to_send] = '\0';
@@ -1107,7 +1107,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 			ret = 0;
 		if (ret) {
 			WARN_ON(1);
-			send_sig(SIGKILL, current, 0);
+			io_send_sig(SIGKILL);
 			return -1;
 		}
 		buf[to_send] = '\0';
diff --git a/kernel/signal.c b/kernel/signal.c
index f3f1f7a..7c14cb4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1422,6 +1422,20 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	return do_send_sig_info(sig, info, p, false);
 }
 
+/* io_send_sig: send a signal caused by an i/o operation
+ *
+ * Use this helper when a signal is being sent to the task that is responsible
+ * for aer initiated operation.  Most commonly this is used to send signals
+ * like SIGPIPE or SIGXFS that are the result of attempting a read or write
+ * operation.  This is used by aio to direct a signal to the correct task in
+ * the case of async operations.
+ */
+int io_send_sig(int sig)
+{
+	return send_sig(sig, current, 0);
+}
+EXPORT_SYMBOL(io_send_sig);
+
 #define __si_special(priv) \
 	((priv) ? SEND_SIG_PRIV : SEND_SIG_NOINFO)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1bb0076..089ccd85 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2343,7 +2343,7 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 
 	if (limit != RLIM_INFINITY) {
 		if (iocb->ki_pos >= limit) {
-			send_sig(SIGXFSZ, current, 0);
+			io_send_sig(SIGXFSZ);
 			return -EFBIG;
 		}
 		iov_iter_truncate(from, limit - (unsigned long)pos);
@@ -2354,8 +2354,10 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	 */
 	if (unlikely(pos + iov_iter_count(from) > MAX_NON_LFS &&
 				!(file->f_flags & O_LARGEFILE))) {
-		if (pos >= MAX_NON_LFS)
+		if (pos >= MAX_NON_LFS) {
+			io_send_sig(SIGXFSZ);
 			return -EFBIG;
+		}
 		iov_iter_truncate(from, MAX_NON_LFS - (unsigned long)pos);
 	}
 
diff --git a/net/atm/common.c b/net/atm/common.c
index 49a872d..3eef736 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -591,7 +591,7 @@ int vcc_sendmsg(struct socket *sock, struct msghdr *m, size_t size)
 	    test_bit(ATM_VF_CLOSE, &vcc->flags) ||
 	    !test_bit(ATM_VF_READY, &vcc->flags)) {
 		error = -EPIPE;
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 	if (!size) {
@@ -620,7 +620,7 @@ int vcc_sendmsg(struct socket *sock, struct msghdr *m, size_t size)
 		    test_bit(ATM_VF_CLOSE, &vcc->flags) ||
 		    !test_bit(ATM_VF_READY, &vcc->flags)) {
 			error = -EPIPE;
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			break;
 		}
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index fbd0acf..8dfd84c 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1457,7 +1457,7 @@ static int ax25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index aa209b1..ba8d8e2 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -663,7 +663,7 @@ static int caif_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 
 pipe_err:
 	if (sent == 0 && !(msg->msg_flags&MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	err = -EPIPE;
 out_err:
 	return sent ? : err;
diff --git a/net/core/stream.c b/net/core/stream.c
index b96f7a7..6b24f6d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -182,7 +182,7 @@ int sk_stream_error(struct sock *sk, int flags, int err)
 	if (err == -EPIPE)
 		err = sock_error(sk) ? : -EPIPE;
 	if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	return err;
 }
 EXPORT_SYMBOL(sk_stream_error);
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 13d6b1a..47ca404 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -1954,7 +1954,7 @@ static int dn_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
 		err = -EPIPE;
 		if (!(flags & MSG_NOSIGNAL))
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 		goto out_err;
 	}
 
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 923abd6..f9c6b55 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -1539,7 +1539,7 @@ static int irda_sendmsg_dgram(struct socket *sock, struct msghdr *msg,
 	lock_sock(sk);
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
@@ -1622,7 +1622,7 @@ static int irda_sendmsg_ultra(struct socket *sock, struct msghdr *msg,
 
 	err = -EPIPE;
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index ed212ff..b5eaecc 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1044,7 +1044,7 @@ static int nr_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index 129d357..954725c 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1065,7 +1065,7 @@ static int rose_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		return -EADDRNOTAVAIL;
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		return -EPIPE;
 	}
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ef1d90f..75bb437 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1556,7 +1556,7 @@ static int sctp_error(struct sock *sk, int flags, int err)
 	if (err == -EPIPE)
 		err = sock_error(sk) ? : -EPIPE;
 	if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	return err;
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a4631477..a1d5cf8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1909,7 +1909,7 @@ pipe_err_free:
 	kfree_skb(skb);
 pipe_err:
 	if (sent == 0 && !(msg->msg_flags&MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	err = -EPIPE;
 out_err:
 	scm_destroy(&scm);
@@ -2026,7 +2026,7 @@ err_unlock:
 err:
 	kfree_skb(newskb);
 	if (send_sigpipe && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	if (!init_scm)
 		scm_destroy(&scm);
 	return err;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index a750f33..102dd03 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1103,7 +1103,7 @@ static int x25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 
 	rc = -EPIPE;
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 01/13] signals: distinguish signals sent due to i/o via io_send_sig()
@ 2016-01-11 22:06   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

In preparation for thread based aio support, make the callers of
send_sig() that are sending a signal as a direct consequence of a
read or write operation (typically for SIGPIPE or SIGXFS) use a
separate helper of io_send_sig().  This will make it possible for
the thread based aio operations to direct these signals to the
process that actually submitted the aio request.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 drivers/gpu/drm/drm_lock.c     |  2 +-
 drivers/gpu/drm/ttm/ttm_lock.c |  6 +++---
 fs/attr.c                      |  2 +-
 fs/binfmt_flat.c               |  2 +-
 fs/fuse/dev.c                  |  2 +-
 fs/pipe.c                      |  4 ++--
 fs/splice.c                    |  8 ++++----
 include/linux/sched.h          |  1 +
 kernel/auditsc.c               |  6 +++---
 kernel/signal.c                | 14 ++++++++++++++
 mm/filemap.c                   |  6 ++++--
 net/atm/common.c               |  4 ++--
 net/ax25/af_ax25.c             |  2 +-
 net/caif/caif_socket.c         |  2 +-
 net/core/stream.c              |  2 +-
 net/decnet/af_decnet.c         |  2 +-
 net/irda/af_irda.c             |  4 ++--
 net/netrom/af_netrom.c         |  2 +-
 net/rose/af_rose.c             |  2 +-
 net/sctp/socket.c              |  2 +-
 net/unix/af_unix.c             |  4 ++--
 net/x25/af_x25.c               |  2 +-
 22 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/drm_lock.c b/drivers/gpu/drm/drm_lock.c
index daa2ff1..3565563 100644
--- a/drivers/gpu/drm/drm_lock.c
+++ b/drivers/gpu/drm/drm_lock.c
@@ -83,7 +83,7 @@ int drm_legacy_lock(struct drm_device *dev, void *data,
 		__set_current_state(TASK_INTERRUPTIBLE);
 		if (!master->lock.hw_lock) {
 			/* Device has been unregistered */
-			send_sig(SIGTERM, current, 0);
+			io_send_sig(SIGTERM);
 			ret = -EINTR;
 			break;
 		}
diff --git a/drivers/gpu/drm/ttm/ttm_lock.c b/drivers/gpu/drm/ttm/ttm_lock.c
index f154fb1..816be91 100644
--- a/drivers/gpu/drm/ttm/ttm_lock.c
+++ b/drivers/gpu/drm/ttm/ttm_lock.c
@@ -68,7 +68,7 @@ static bool __ttm_read_lock(struct ttm_lock *lock)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
@@ -101,7 +101,7 @@ static bool __ttm_read_trylock(struct ttm_lock *lock, bool *locked)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
@@ -151,7 +151,7 @@ static bool __ttm_write_lock(struct ttm_lock *lock)
 
 	spin_lock(&lock->lock);
 	if (unlikely(lock->kill_takers)) {
-		send_sig(lock->signal, current, 0);
+		io_send_sig(lock->signal);
 		spin_unlock(&lock->lock);
 		return false;
 	}
diff --git a/fs/attr.c b/fs/attr.c
index 6530ced..0c63049 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -118,7 +118,7 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 
 	return 0;
 out_sig:
-	send_sig(SIGXFSZ, current, 0);
+	io_send_sig(SIGXFSZ);
 out_big:
 	return -EFBIG;
 }
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index f723cd3..51cf839 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -373,7 +373,7 @@ calc_reloc(unsigned long r, struct lib_info *p, int curid, int internalp)
 
 failed:
 	printk(", killing %s!\n", current->comm);
-	send_sig(SIGSEGV, current, 0);
+	io_send_sig(SIGSEGV);
 
 	return RELOC_FAILED;
 }
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ebb5e37..20ffc52 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1391,7 +1391,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	pipe_lock(pipe);
 
 	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		if (!ret)
 			ret = -EPIPE;
 		goto out_unlock;
diff --git a/fs/pipe.c b/fs/pipe.c
index 42cf8dd..e55ed9a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -351,7 +351,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	__pipe_lock(pipe);
 
 	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		ret = -EPIPE;
 		goto out;
 	}
@@ -386,7 +386,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		int bufs;
 
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
diff --git a/fs/splice.c b/fs/splice.c
index 4cf700d..336db78 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -193,7 +193,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 
 	for (;;) {
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
@@ -1769,7 +1769,7 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 
 	while (pipe->nrbufs >= pipe->buffers) {
 		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			ret = -EPIPE;
 			break;
 		}
@@ -1820,7 +1820,7 @@ retry:
 
 	do {
 		if (!opipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
@@ -1924,7 +1924,7 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 
 	do {
 		if (!opipe->readers) {
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			if (!ret)
 				ret = -EPIPE;
 			break;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..6376d58 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2502,6 +2502,7 @@ extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
+extern int io_send_sig(int signal);
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index b86cc04..8b4a3ea 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1025,7 +1025,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 	 * any.
 	 */
 	if (WARN_ON_ONCE(len < 0 || len > MAX_ARG_STRLEN - 1)) {
-		send_sig(SIGKILL, current, 0);
+		io_send_sig(SIGKILL);
 		return -1;
 	}
 
@@ -1043,7 +1043,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 		 */
 		if (ret) {
 			WARN_ON(1);
-			send_sig(SIGKILL, current, 0);
+			io_send_sig(SIGKILL);
 			return -1;
 		}
 		buf[to_send] = '\0';
@@ -1107,7 +1107,7 @@ static int audit_log_single_execve_arg(struct audit_context *context,
 			ret = 0;
 		if (ret) {
 			WARN_ON(1);
-			send_sig(SIGKILL, current, 0);
+			io_send_sig(SIGKILL);
 			return -1;
 		}
 		buf[to_send] = '\0';
diff --git a/kernel/signal.c b/kernel/signal.c
index f3f1f7a..7c14cb4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1422,6 +1422,20 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	return do_send_sig_info(sig, info, p, false);
 }
 
+/* io_send_sig: send a signal caused by an i/o operation
+ *
+ * Use this helper when a signal is being sent to the task that is responsible
+ * for aer initiated operation.  Most commonly this is used to send signals
+ * like SIGPIPE or SIGXFS that are the result of attempting a read or write
+ * operation.  This is used by aio to direct a signal to the correct task in
+ * the case of async operations.
+ */
+int io_send_sig(int sig)
+{
+	return send_sig(sig, current, 0);
+}
+EXPORT_SYMBOL(io_send_sig);
+
 #define __si_special(priv) \
 	((priv) ? SEND_SIG_PRIV : SEND_SIG_NOINFO)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1bb0076..089ccd85 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2343,7 +2343,7 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 
 	if (limit != RLIM_INFINITY) {
 		if (iocb->ki_pos >= limit) {
-			send_sig(SIGXFSZ, current, 0);
+			io_send_sig(SIGXFSZ);
 			return -EFBIG;
 		}
 		iov_iter_truncate(from, limit - (unsigned long)pos);
@@ -2354,8 +2354,10 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	 */
 	if (unlikely(pos + iov_iter_count(from) > MAX_NON_LFS &&
 				!(file->f_flags & O_LARGEFILE))) {
-		if (pos >= MAX_NON_LFS)
+		if (pos >= MAX_NON_LFS) {
+			io_send_sig(SIGXFSZ);
 			return -EFBIG;
+		}
 		iov_iter_truncate(from, MAX_NON_LFS - (unsigned long)pos);
 	}
 
diff --git a/net/atm/common.c b/net/atm/common.c
index 49a872d..3eef736 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -591,7 +591,7 @@ int vcc_sendmsg(struct socket *sock, struct msghdr *m, size_t size)
 	    test_bit(ATM_VF_CLOSE, &vcc->flags) ||
 	    !test_bit(ATM_VF_READY, &vcc->flags)) {
 		error = -EPIPE;
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 	if (!size) {
@@ -620,7 +620,7 @@ int vcc_sendmsg(struct socket *sock, struct msghdr *m, size_t size)
 		    test_bit(ATM_VF_CLOSE, &vcc->flags) ||
 		    !test_bit(ATM_VF_READY, &vcc->flags)) {
 			error = -EPIPE;
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 			break;
 		}
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index fbd0acf..8dfd84c 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1457,7 +1457,7 @@ static int ax25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index aa209b1..ba8d8e2 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -663,7 +663,7 @@ static int caif_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 
 pipe_err:
 	if (sent == 0 && !(msg->msg_flags&MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	err = -EPIPE;
 out_err:
 	return sent ? : err;
diff --git a/net/core/stream.c b/net/core/stream.c
index b96f7a7..6b24f6d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -182,7 +182,7 @@ int sk_stream_error(struct sock *sk, int flags, int err)
 	if (err == -EPIPE)
 		err = sock_error(sk) ? : -EPIPE;
 	if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	return err;
 }
 EXPORT_SYMBOL(sk_stream_error);
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 13d6b1a..47ca404 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -1954,7 +1954,7 @@ static int dn_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
 		err = -EPIPE;
 		if (!(flags & MSG_NOSIGNAL))
-			send_sig(SIGPIPE, current, 0);
+			io_send_sig(SIGPIPE);
 		goto out_err;
 	}
 
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 923abd6..f9c6b55 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -1539,7 +1539,7 @@ static int irda_sendmsg_dgram(struct socket *sock, struct msghdr *msg,
 	lock_sock(sk);
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
@@ -1622,7 +1622,7 @@ static int irda_sendmsg_ultra(struct socket *sock, struct msghdr *msg,
 
 	err = -EPIPE;
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index ed212ff..b5eaecc 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1044,7 +1044,7 @@ static int nr_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		err = -EPIPE;
 		goto out;
 	}
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index 129d357..954725c 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1065,7 +1065,7 @@ static int rose_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		return -EADDRNOTAVAIL;
 
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		return -EPIPE;
 	}
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ef1d90f..75bb437 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1556,7 +1556,7 @@ static int sctp_error(struct sock *sk, int flags, int err)
 	if (err == -EPIPE)
 		err = sock_error(sk) ? : -EPIPE;
 	if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	return err;
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a4631477..a1d5cf8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1909,7 +1909,7 @@ pipe_err_free:
 	kfree_skb(skb);
 pipe_err:
 	if (sent == 0 && !(msg->msg_flags&MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	err = -EPIPE;
 out_err:
 	scm_destroy(&scm);
@@ -2026,7 +2026,7 @@ err_unlock:
 err:
 	kfree_skb(newskb);
 	if (send_sigpipe && !(flags & MSG_NOSIGNAL))
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 	if (!init_scm)
 		scm_destroy(&scm);
 	return err;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index a750f33..102dd03 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1103,7 +1103,7 @@ static int x25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 
 	rc = -EPIPE;
 	if (sk->sk_shutdown & SEND_SHUTDOWN) {
-		send_sig(SIGPIPE, current, 0);
+		io_send_sig(SIGPIPE);
 		goto out;
 	}
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 02/13] aio: add aio_get_mm() helper
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:06   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

For various async operations, it is necessary to have a way of finding
the address space to use for accessing user memory.  Add the helper
struct mm_struct *aio_get_mm(struct kiocb *) to address this use-case.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c            | 15 +++++++++++++++
 include/linux/aio.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index e0d5398..2cd5071 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -154,6 +154,7 @@ struct kioctx {
 	struct file		*aio_ring_file;
 
 	unsigned		id;
+	struct mm_struct	*mm;
 };
 
 /*
@@ -202,6 +203,8 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static void aio_complete(struct kiocb *kiocb, long res, long res2);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -568,6 +571,17 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 	return cancel(&kiocb->common);
 }
 
+struct mm_struct *aio_get_mm(struct kiocb *req)
+{
+	if (req->ki_complete == aio_complete) {
+		struct aio_kiocb *iocb;
+
+		iocb = container_of(req, struct aio_kiocb, common);
+		return iocb->ki_ctx->mm;
+	}
+	return NULL;
+}
+
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -719,6 +733,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 		return ERR_PTR(-ENOMEM);
 
 	ctx->max_reqs = nr_events;
+	ctx->mm = mm;
 
 	spin_lock_init(&ctx->ctx_lock);
 	spin_lock_init(&ctx->completion_lock);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9eb42db..c5791d4 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -17,6 +17,7 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct mm_struct *aio_get_mm(struct kiocb *req);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -24,6 +25,7 @@ static inline long do_io_submit(aio_context_t ctx_id, long nr,
 				bool compat) { return 0; }
 static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
+static inline struct mm_struct *aio_get_mm(struct kiocb *req) { return NULL; }
 #endif /* CONFIG_AIO */
 
 /* for sysctl: */
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 02/13] aio: add aio_get_mm() helper
@ 2016-01-11 22:06   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

For various async operations, it is necessary to have a way of finding
the address space to use for accessing user memory.  Add the helper
struct mm_struct *aio_get_mm(struct kiocb *) to address this use-case.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c            | 15 +++++++++++++++
 include/linux/aio.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index e0d5398..2cd5071 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -154,6 +154,7 @@ struct kioctx {
 	struct file		*aio_ring_file;
 
 	unsigned		id;
+	struct mm_struct	*mm;
 };
 
 /*
@@ -202,6 +203,8 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static void aio_complete(struct kiocb *kiocb, long res, long res2);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -568,6 +571,17 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 	return cancel(&kiocb->common);
 }
 
+struct mm_struct *aio_get_mm(struct kiocb *req)
+{
+	if (req->ki_complete == aio_complete) {
+		struct aio_kiocb *iocb;
+
+		iocb = container_of(req, struct aio_kiocb, common);
+		return iocb->ki_ctx->mm;
+	}
+	return NULL;
+}
+
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -719,6 +733,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 		return ERR_PTR(-ENOMEM);
 
 	ctx->max_reqs = nr_events;
+	ctx->mm = mm;
 
 	spin_lock_init(&ctx->ctx_lock);
 	spin_lock_init(&ctx->completion_lock);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9eb42db..c5791d4 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -17,6 +17,7 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct mm_struct *aio_get_mm(struct kiocb *req);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -24,6 +25,7 @@ static inline long do_io_submit(aio_context_t ctx_id, long nr,
 				bool compat) { return 0; }
 static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
+static inline struct mm_struct *aio_get_mm(struct kiocb *req) { return NULL; }
 #endif /* CONFIG_AIO */
 
 /* for sysctl: */
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 02/13] aio: add aio_get_mm() helper
@ 2016-01-11 22:06   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

For various async operations, it is necessary to have a way of finding
the address space to use for accessing user memory.  Add the helper
struct mm_struct *aio_get_mm(struct kiocb *) to address this use-case.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c            | 15 +++++++++++++++
 include/linux/aio.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index e0d5398..2cd5071 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -154,6 +154,7 @@ struct kioctx {
 	struct file		*aio_ring_file;
 
 	unsigned		id;
+	struct mm_struct	*mm;
 };
 
 /*
@@ -202,6 +203,8 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static void aio_complete(struct kiocb *kiocb, long res, long res2);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -568,6 +571,17 @@ static int kiocb_cancel(struct aio_kiocb *kiocb)
 	return cancel(&kiocb->common);
 }
 
+struct mm_struct *aio_get_mm(struct kiocb *req)
+{
+	if (req->ki_complete == aio_complete) {
+		struct aio_kiocb *iocb;
+
+		iocb = container_of(req, struct aio_kiocb, common);
+		return iocb->ki_ctx->mm;
+	}
+	return NULL;
+}
+
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -719,6 +733,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 		return ERR_PTR(-ENOMEM);
 
 	ctx->max_reqs = nr_events;
+	ctx->mm = mm;
 
 	spin_lock_init(&ctx->ctx_lock);
 	spin_lock_init(&ctx->completion_lock);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9eb42db..c5791d4 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -17,6 +17,7 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct mm_struct *aio_get_mm(struct kiocb *req);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -24,6 +25,7 @@ static inline long do_io_submit(aio_context_t ctx_id, long nr,
 				bool compat) { return 0; }
 static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
+static inline struct mm_struct *aio_get_mm(struct kiocb *req) { return NULL; }
 #endif /* CONFIG_AIO */
 
 /* for sysctl: */
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 03/13] aio: for async operations, make the iter argument persistent
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:06   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

When implementing async read/write operations, the complexity of having
to duplicate the iter argument before passing to another thread leads to
duplicate code.  There is no reason async operations issued by the aio
core need to be placed on the stack when an aio_kiocb is allocated for
each operation, so put the iter and iovec into aio_kiocb instead of on
the stack.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2cd5071..fc453ca 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -187,6 +187,10 @@ struct aio_kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	struct iov_iter		ki_iter;
+	struct iovec		*ki_iovec;
+	struct iovec		ki_inline_vecs[UIO_FASTIOV];
 };
 
 /*------ sysctl variables----*/
@@ -1026,6 +1030,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 	percpu_ref_get(&ctx->reqs);
 
 	req->ki_ctx = ctx;
+	req->ki_iovec = req->ki_inline_vecs;
 	return req;
 out_put:
 	put_reqs_available(ctx, 1);
@@ -1038,6 +1043,8 @@ static void kiocb_free(struct aio_kiocb *req)
 		fput(req->common.ki_filp);
 	if (req->ki_eventfd != NULL)
 		eventfd_ctx_put(req->ki_eventfd);
+	if (req->ki_iovec != req->ki_inline_vecs)
+		kfree(req->ki_iovec);
 	kmem_cache_free(kiocb_cachep, req);
 }
 
@@ -1417,16 +1424,14 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
+static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
-	struct file *file = req->ki_filp;
+	struct file *file = req->common.ki_filp;
 	ssize_t ret;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
-	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
-	struct iov_iter iter;
 
 	switch (opcode) {
 	case IOCB_CMD_PREAD:
@@ -1451,43 +1456,39 @@ rw_common:
 
 		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
 			ret = aio_setup_vectored_rw(rw, buf, len,
-						&iovec, compat, &iter);
+						    &req->ki_iovec, compat,
+						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, iovec, &iter);
-			iovec = NULL;
+			ret = import_single_range(rw, buf, len, req->ki_iovec,
+						  &req->ki_iter);
 		}
 		if (!ret)
-			ret = rw_verify_area(rw, file, &req->ki_pos,
-					     iov_iter_count(&iter));
-		if (ret < 0) {
-			kfree(iovec);
+			ret = rw_verify_area(rw, file, &req->common.ki_pos,
+					     iov_iter_count(&req->ki_iter));
+		if (ret < 0)
 			return ret;
-		}
-
-		len = ret;
 
 		if (rw == WRITE)
 			file_start_write(file);
 
-		ret = iter_op(req, &iter);
+		ret = iter_op(&req->common, &req->ki_iter);
 
 		if (rw == WRITE)
 			file_end_write(file);
-		kfree(iovec);
 		break;
 
 	case IOCB_CMD_FDSYNC:
 		if (!file->f_op->aio_fsync)
 			return -EINVAL;
 
-		ret = file->f_op->aio_fsync(req, 1);
+		ret = file->f_op->aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
 		if (!file->f_op->aio_fsync)
 			return -EINVAL;
 
-		ret = file->f_op->aio_fsync(req, 0);
+		ret = file->f_op->aio_fsync(&req->common, 0);
 		break;
 
 	default:
@@ -1504,7 +1505,7 @@ rw_common:
 			     ret == -ERESTARTNOHAND ||
 			     ret == -ERESTART_RESTARTBLOCK))
 			ret = -EINTR;
-		aio_complete(req, ret, 0);
+		aio_complete(&req->common, ret, 0);
 	}
 
 	return 0;
@@ -1571,7 +1572,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(&req->common, iocb->aio_lio_opcode,
+	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
 			   (char __user *)(unsigned long)iocb->aio_buf,
 			   iocb->aio_nbytes,
 			   compat);
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 03/13] aio: for async operations, make the iter argument persistent
@ 2016-01-11 22:06   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

When implementing async read/write operations, the complexity of having
to duplicate the iter argument before passing to another thread leads to
duplicate code.  There is no reason async operations issued by the aio
core need to be placed on the stack when an aio_kiocb is allocated for
each operation, so put the iter and iovec into aio_kiocb instead of on
the stack.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2cd5071..fc453ca 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -187,6 +187,10 @@ struct aio_kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	struct iov_iter		ki_iter;
+	struct iovec		*ki_iovec;
+	struct iovec		ki_inline_vecs[UIO_FASTIOV];
 };
 
 /*------ sysctl variables----*/
@@ -1026,6 +1030,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 	percpu_ref_get(&ctx->reqs);
 
 	req->ki_ctx = ctx;
+	req->ki_iovec = req->ki_inline_vecs;
 	return req;
 out_put:
 	put_reqs_available(ctx, 1);
@@ -1038,6 +1043,8 @@ static void kiocb_free(struct aio_kiocb *req)
 		fput(req->common.ki_filp);
 	if (req->ki_eventfd != NULL)
 		eventfd_ctx_put(req->ki_eventfd);
+	if (req->ki_iovec != req->ki_inline_vecs)
+		kfree(req->ki_iovec);
 	kmem_cache_free(kiocb_cachep, req);
 }
 
@@ -1417,16 +1424,14 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
+static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
-	struct file *file = req->ki_filp;
+	struct file *file = req->common.ki_filp;
 	ssize_t ret;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
-	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
-	struct iov_iter iter;
 
 	switch (opcode) {
 	case IOCB_CMD_PREAD:
@@ -1451,43 +1456,39 @@ rw_common:
 
 		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
 			ret = aio_setup_vectored_rw(rw, buf, len,
-						&iovec, compat, &iter);
+						    &req->ki_iovec, compat,
+						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, iovec, &iter);
-			iovec = NULL;
+			ret = import_single_range(rw, buf, len, req->ki_iovec,
+						  &req->ki_iter);
 		}
 		if (!ret)
-			ret = rw_verify_area(rw, file, &req->ki_pos,
-					     iov_iter_count(&iter));
-		if (ret < 0) {
-			kfree(iovec);
+			ret = rw_verify_area(rw, file, &req->common.ki_pos,
+					     iov_iter_count(&req->ki_iter));
+		if (ret < 0)
 			return ret;
-		}
-
-		len = ret;
 
 		if (rw == WRITE)
 			file_start_write(file);
 
-		ret = iter_op(req, &iter);
+		ret = iter_op(&req->common, &req->ki_iter);
 
 		if (rw == WRITE)
 			file_end_write(file);
-		kfree(iovec);
 		break;
 
 	case IOCB_CMD_FDSYNC:
 		if (!file->f_op->aio_fsync)
 			return -EINVAL;
 
-		ret = file->f_op->aio_fsync(req, 1);
+		ret = file->f_op->aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
 		if (!file->f_op->aio_fsync)
 			return -EINVAL;
 
-		ret = file->f_op->aio_fsync(req, 0);
+		ret = file->f_op->aio_fsync(&req->common, 0);
 		break;
 
 	default:
@@ -1504,7 +1505,7 @@ rw_common:
 			     ret == -ERESTARTNOHAND ||
 			     ret == -ERESTART_RESTARTBLOCK))
 			ret = -EINTR;
-		aio_complete(req, ret, 0);
+		aio_complete(&req->common, ret, 0);
 	}
 
 	return 0;
@@ -1571,7 +1572,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(&req->common, iocb->aio_lio_opcode,
+	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
 			   (char __user *)(unsigned long)iocb->aio_buf,
 			   iocb->aio_nbytes,
 			   compat);
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 03/13] aio: for async operations, make the iter argument persistent
@ 2016-01-11 22:06   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:06 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

When implementing async read/write operations, the complexity of having
to duplicate the iter argument before passing to another thread leads to
duplicate code.  There is no reason async operations issued by the aio
core need to be placed on the stack when an aio_kiocb is allocated for
each operation, so put the iter and iovec into aio_kiocb instead of on
the stack.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2cd5071..fc453ca 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -187,6 +187,10 @@ struct aio_kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	struct iov_iter		ki_iter;
+	struct iovec		*ki_iovec;
+	struct iovec		ki_inline_vecs[UIO_FASTIOV];
 };
 
 /*------ sysctl variables----*/
@@ -1026,6 +1030,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 	percpu_ref_get(&ctx->reqs);
 
 	req->ki_ctx = ctx;
+	req->ki_iovec = req->ki_inline_vecs;
 	return req;
 out_put:
 	put_reqs_available(ctx, 1);
@@ -1038,6 +1043,8 @@ static void kiocb_free(struct aio_kiocb *req)
 		fput(req->common.ki_filp);
 	if (req->ki_eventfd != NULL)
 		eventfd_ctx_put(req->ki_eventfd);
+	if (req->ki_iovec != req->ki_inline_vecs)
+		kfree(req->ki_iovec);
 	kmem_cache_free(kiocb_cachep, req);
 }
 
@@ -1417,16 +1424,14 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
+static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
-	struct file *file = req->ki_filp;
+	struct file *file = req->common.ki_filp;
 	ssize_t ret;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
-	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
-	struct iov_iter iter;
 
 	switch (opcode) {
 	case IOCB_CMD_PREAD:
@@ -1451,43 +1456,39 @@ rw_common:
 
 		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
 			ret = aio_setup_vectored_rw(rw, buf, len,
-						&iovec, compat, &iter);
+						    &req->ki_iovec, compat,
+						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, iovec, &iter);
-			iovec = NULL;
+			ret = import_single_range(rw, buf, len, req->ki_iovec,
+						  &req->ki_iter);
 		}
 		if (!ret)
-			ret = rw_verify_area(rw, file, &req->ki_pos,
-					     iov_iter_count(&iter));
-		if (ret < 0) {
-			kfree(iovec);
+			ret = rw_verify_area(rw, file, &req->common.ki_pos,
+					     iov_iter_count(&req->ki_iter));
+		if (ret < 0)
 			return ret;
-		}
-
-		len = ret;
 
 		if (rw == WRITE)
 			file_start_write(file);
 
-		ret = iter_op(req, &iter);
+		ret = iter_op(&req->common, &req->ki_iter);
 
 		if (rw == WRITE)
 			file_end_write(file);
-		kfree(iovec);
 		break;
 
 	case IOCB_CMD_FDSYNC:
 		if (!file->f_op->aio_fsync)
 			return -EINVAL;
 
-		ret = file->f_op->aio_fsync(req, 1);
+		ret = file->f_op->aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
 		if (!file->f_op->aio_fsync)
 			return -EINVAL;
 
-		ret = file->f_op->aio_fsync(req, 0);
+		ret = file->f_op->aio_fsync(&req->common, 0);
 		break;
 
 	default:
@@ -1504,7 +1505,7 @@ rw_common:
 			     ret == -ERESTARTNOHAND ||
 			     ret == -ERESTART_RESTARTBLOCK))
 			ret = -EINTR;
-		aio_complete(req, ret, 0);
+		aio_complete(&req->common, ret, 0);
 	}
 
 	return 0;
@@ -1571,7 +1572,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(&req->common, iocb->aio_lio_opcode,
+	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
 			   (char __user *)(unsigned long)iocb->aio_buf,
 			   iocb->aio_nbytes,
 			   compat);
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 04/13] signals: add and use aio_get_task() to direct signals sent via io_send_sig()
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

When a signal is triggered due to an i/o, io_send_sig() needs to deliver
the signal to the task issuing the i/o.  Prepare for thread based aios
by annotating task_struct with a struct kiocb pointer that enables
io_sed_sig() to direct these signals to the submitter of the aio.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c              | 16 ++++++++++++++++
 include/linux/aio.h   |  3 +++
 include/linux/sched.h |  5 +++++
 kernel/signal.c       |  8 +++++++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index fc453ca..55c8ff5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -191,6 +191,9 @@ struct aio_kiocb {
 	struct iov_iter		ki_iter;
 	struct iovec		*ki_iovec;
 	struct iovec		ki_inline_vecs[UIO_FASTIOV];
+
+	/* Fields used for threaded aio helper. */
+	struct task_struct	*ki_submit_task;
 };
 
 /*------ sysctl variables----*/
@@ -586,6 +589,17 @@ struct mm_struct *aio_get_mm(struct kiocb *req)
 	return NULL;
 }
 
+struct task_struct *aio_get_task(struct kiocb *req)
+{
+	if (req->ki_complete == aio_complete) {
+		struct aio_kiocb *iocb;
+
+		iocb = container_of(req, struct aio_kiocb, common);
+		return iocb->ki_submit_task;
+	}
+	return current;
+}
+
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -1045,6 +1059,8 @@ static void kiocb_free(struct aio_kiocb *req)
 		eventfd_ctx_put(req->ki_eventfd);
 	if (req->ki_iovec != req->ki_inline_vecs)
 		kfree(req->ki_iovec);
+	if (req->ki_submit_task)
+		put_task_struct(req->ki_submit_task);
 	kmem_cache_free(kiocb_cachep, req);
 }
 
diff --git a/include/linux/aio.h b/include/linux/aio.h
index c5791d4..9a62e8a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -18,6 +18,7 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 struct mm_struct *aio_get_mm(struct kiocb *req);
+struct task_struct *aio_get_task(struct kiocb *req);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -26,6 +27,8 @@ static inline long do_io_submit(aio_context_t ctx_id, long nr,
 static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
 static inline struct mm_struct *aio_get_mm(struct kiocb *req) { return NULL; }
+static inline struct task_struct *aio_get_task(struct kiocb *req)
+{ return current; }
 #endif /* CONFIG_AIO */
 
 /* for sysctl: */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6376d58..bdbf11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1633,6 +1633,11 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 
+/* threaded aio info */
+#if IS_ENABLED(CONFIG_AIO)
+	struct kiocb *kiocb;
+#endif
+
 /* stacked block device info */
 	struct bio_list *bio_list;
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 7c14cb4..5da9180 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/aio.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1432,7 +1433,12 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
  */
 int io_send_sig(int sig)
 {
-	return send_sig(sig, current, 0);
+	struct task_struct *task = current;
+#if IS_ENABLED(CONFIG_AIO)
+	if (task->kiocb)
+		task = aio_get_task(task->kiocb);
+#endif
+	return send_sig(sig, task, 0);
 }
 EXPORT_SYMBOL(io_send_sig);
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 04/13] signals: add and use aio_get_task() to direct signals sent via io_send_sig()
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

When a signal is triggered due to an i/o, io_send_sig() needs to deliver
the signal to the task issuing the i/o.  Prepare for thread based aios
by annotating task_struct with a struct kiocb pointer that enables
io_sed_sig() to direct these signals to the submitter of the aio.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c              | 16 ++++++++++++++++
 include/linux/aio.h   |  3 +++
 include/linux/sched.h |  5 +++++
 kernel/signal.c       |  8 +++++++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index fc453ca..55c8ff5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -191,6 +191,9 @@ struct aio_kiocb {
 	struct iov_iter		ki_iter;
 	struct iovec		*ki_iovec;
 	struct iovec		ki_inline_vecs[UIO_FASTIOV];
+
+	/* Fields used for threaded aio helper. */
+	struct task_struct	*ki_submit_task;
 };
 
 /*------ sysctl variables----*/
@@ -586,6 +589,17 @@ struct mm_struct *aio_get_mm(struct kiocb *req)
 	return NULL;
 }
 
+struct task_struct *aio_get_task(struct kiocb *req)
+{
+	if (req->ki_complete == aio_complete) {
+		struct aio_kiocb *iocb;
+
+		iocb = container_of(req, struct aio_kiocb, common);
+		return iocb->ki_submit_task;
+	}
+	return current;
+}
+
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -1045,6 +1059,8 @@ static void kiocb_free(struct aio_kiocb *req)
 		eventfd_ctx_put(req->ki_eventfd);
 	if (req->ki_iovec != req->ki_inline_vecs)
 		kfree(req->ki_iovec);
+	if (req->ki_submit_task)
+		put_task_struct(req->ki_submit_task);
 	kmem_cache_free(kiocb_cachep, req);
 }
 
diff --git a/include/linux/aio.h b/include/linux/aio.h
index c5791d4..9a62e8a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -18,6 +18,7 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 struct mm_struct *aio_get_mm(struct kiocb *req);
+struct task_struct *aio_get_task(struct kiocb *req);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -26,6 +27,8 @@ static inline long do_io_submit(aio_context_t ctx_id, long nr,
 static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
 static inline struct mm_struct *aio_get_mm(struct kiocb *req) { return NULL; }
+static inline struct task_struct *aio_get_task(struct kiocb *req)
+{ return current; }
 #endif /* CONFIG_AIO */
 
 /* for sysctl: */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6376d58..bdbf11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1633,6 +1633,11 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 
+/* threaded aio info */
+#if IS_ENABLED(CONFIG_AIO)
+	struct kiocb *kiocb;
+#endif
+
 /* stacked block device info */
 	struct bio_list *bio_list;
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 7c14cb4..5da9180 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/aio.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1432,7 +1433,12 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
  */
 int io_send_sig(int sig)
 {
-	return send_sig(sig, current, 0);
+	struct task_struct *task = current;
+#if IS_ENABLED(CONFIG_AIO)
+	if (task->kiocb)
+		task = aio_get_task(task->kiocb);
+#endif
+	return send_sig(sig, task, 0);
 }
 EXPORT_SYMBOL(io_send_sig);
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 04/13] signals: add and use aio_get_task() to direct signals sent via io_send_sig()
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

When a signal is triggered due to an i/o, io_send_sig() needs to deliver
the signal to the task issuing the i/o.  Prepare for thread based aios
by annotating task_struct with a struct kiocb pointer that enables
io_sed_sig() to direct these signals to the submitter of the aio.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c              | 16 ++++++++++++++++
 include/linux/aio.h   |  3 +++
 include/linux/sched.h |  5 +++++
 kernel/signal.c       |  8 +++++++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index fc453ca..55c8ff5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -191,6 +191,9 @@ struct aio_kiocb {
 	struct iov_iter		ki_iter;
 	struct iovec		*ki_iovec;
 	struct iovec		ki_inline_vecs[UIO_FASTIOV];
+
+	/* Fields used for threaded aio helper. */
+	struct task_struct	*ki_submit_task;
 };
 
 /*------ sysctl variables----*/
@@ -586,6 +589,17 @@ struct mm_struct *aio_get_mm(struct kiocb *req)
 	return NULL;
 }
 
+struct task_struct *aio_get_task(struct kiocb *req)
+{
+	if (req->ki_complete == aio_complete) {
+		struct aio_kiocb *iocb;
+
+		iocb = container_of(req, struct aio_kiocb, common);
+		return iocb->ki_submit_task;
+	}
+	return current;
+}
+
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -1045,6 +1059,8 @@ static void kiocb_free(struct aio_kiocb *req)
 		eventfd_ctx_put(req->ki_eventfd);
 	if (req->ki_iovec != req->ki_inline_vecs)
 		kfree(req->ki_iovec);
+	if (req->ki_submit_task)
+		put_task_struct(req->ki_submit_task);
 	kmem_cache_free(kiocb_cachep, req);
 }
 
diff --git a/include/linux/aio.h b/include/linux/aio.h
index c5791d4..9a62e8a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -18,6 +18,7 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 struct mm_struct *aio_get_mm(struct kiocb *req);
+struct task_struct *aio_get_task(struct kiocb *req);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -26,6 +27,8 @@ static inline long do_io_submit(aio_context_t ctx_id, long nr,
 static inline void kiocb_set_cancel_fn(struct kiocb *req,
 				       kiocb_cancel_fn *cancel) { }
 static inline struct mm_struct *aio_get_mm(struct kiocb *req) { return NULL; }
+static inline struct task_struct *aio_get_task(struct kiocb *req)
+{ return current; }
 #endif /* CONFIG_AIO */
 
 /* for sysctl: */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6376d58..bdbf11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1633,6 +1633,11 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 
+/* threaded aio info */
+#if IS_ENABLED(CONFIG_AIO)
+	struct kiocb *kiocb;
+#endif
+
 /* stacked block device info */
 	struct bio_list *bio_list;
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 7c14cb4..5da9180 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/aio.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1432,7 +1433,12 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
  */
 int io_send_sig(int sig)
 {
-	return send_sig(sig, current, 0);
+	struct task_struct *task = current;
+#if IS_ENABLED(CONFIG_AIO)
+	if (task->kiocb)
+		task = aio_get_task(task->kiocb);
+#endif
+	return send_sig(sig, task, 0);
 }
 EXPORT_SYMBOL(io_send_sig);
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 05/13] fs: make do_loop_readv_writev() non-static
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

The threaded aio helper code needs to be able to call
do_loop_readv_writev() to perform i/o to file_operations that do not have
read_iter or write_iter methods.  Make the prototype for
do_loop_readv_writev() non-static and move it into fs/internal.h

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/internal.h   | 6 ++++++
 fs/read_write.c | 5 +----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 71859c4d..57b6010 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,9 @@ struct path;
 struct mount;
 struct shrink_control;
 
+typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
+typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
+
 /*
  * block_dev.c
  */
@@ -135,6 +138,9 @@ extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
  * read_write.c
  */
 extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
+extern ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
+				    loff_t *ppos, io_fn_t fn);
+
 
 /*
  * pipe.c
diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..36344ff 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -21,9 +21,6 @@
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 
-typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
-typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
-
 const struct file_operations generic_ro_fops = {
 	.llseek		= generic_file_llseek,
 	.read_iter	= generic_file_read_iter,
@@ -668,7 +665,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 }
 
 /* Do it by hand, with file-ops */
-static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
+ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
 		loff_t *ppos, io_fn_t fn)
 {
 	ssize_t ret = 0;
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 05/13] fs: make do_loop_readv_writev() non-static
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

The threaded aio helper code needs to be able to call
do_loop_readv_writev() to perform i/o to file_operations that do not have
read_iter or write_iter methods.  Make the prototype for
do_loop_readv_writev() non-static and move it into fs/internal.h

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/internal.h   | 6 ++++++
 fs/read_write.c | 5 +----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 71859c4d..57b6010 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,9 @@ struct path;
 struct mount;
 struct shrink_control;
 
+typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
+typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
+
 /*
  * block_dev.c
  */
@@ -135,6 +138,9 @@ extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
  * read_write.c
  */
 extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
+extern ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
+				    loff_t *ppos, io_fn_t fn);
+
 
 /*
  * pipe.c
diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..36344ff 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -21,9 +21,6 @@
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 
-typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
-typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
-
 const struct file_operations generic_ro_fops = {
 	.llseek		= generic_file_llseek,
 	.read_iter	= generic_file_read_iter,
@@ -668,7 +665,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 }
 
 /* Do it by hand, with file-ops */
-static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
+ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
 		loff_t *ppos, io_fn_t fn)
 {
 	ssize_t ret = 0;
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 05/13] fs: make do_loop_readv_writev() non-static
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

The threaded aio helper code needs to be able to call
do_loop_readv_writev() to perform i/o to file_operations that do not have
read_iter or write_iter methods.  Make the prototype for
do_loop_readv_writev() non-static and move it into fs/internal.h

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/internal.h   | 6 ++++++
 fs/read_write.c | 5 +----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 71859c4d..57b6010 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,9 @@ struct path;
 struct mount;
 struct shrink_control;
 
+typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
+typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
+
 /*
  * block_dev.c
  */
@@ -135,6 +138,9 @@ extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
  * read_write.c
  */
 extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
+extern ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
+				    loff_t *ppos, io_fn_t fn);
+
 
 /*
  * pipe.c
diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..36344ff 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -21,9 +21,6 @@
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 
-typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
-typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
-
 const struct file_operations generic_ro_fops = {
 	.llseek		= generic_file_llseek,
 	.read_iter	= generic_file_read_iter,
@@ -668,7 +665,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 }
 
 /* Do it by hand, with file-ops */
-static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
+ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
 		loff_t *ppos, io_fn_t fn)
 {
 	ssize_t ret = 0;
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 06/13] aio: add queue_work() based threaded aio support
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Add support for performing asynchronous reads and writes via kernel
threads by way of the queue_work() functionality.  This enables fully
asynchronous and cancellable reads and writes for any file or device in
the kernel.  Cancellation is implemented by sending a SIGKILL to the
kernel thread executing the async operation.  So long as the read or
write operation can be interrupted by signals, the AIO request can be
cancelled.

This functionality is currently disabled by default until the DoS
implications of having user controlled kernel thread creation are fully
understood.  When compiled into the kernel, this functionality can be
enabled by setting the fs.aio-auto-threads sysctl to 1.  It is expected
that the feature will be enabled by default in a future kernel version.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c            | 236 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   4 +
 include/linux/fs.h  |   2 +
 init/Kconfig        |  13 +++
 kernel/sysctl.c     |   9 ++
 5 files changed, 264 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 55c8ff5..88af450 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -48,6 +48,7 @@
 
 #define AIO_RING_MAGIC			0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES	1
+#define AIO_RING_COMPAT_THREADED	2
 #define AIO_RING_INCOMPAT_FEATURES	0
 struct aio_ring {
 	unsigned	id;	/* kernel internal index number */
@@ -157,6 +158,9 @@ struct kioctx {
 	struct mm_struct	*mm;
 };
 
+struct aio_kiocb;
+typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
+
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
  * cancelled or completed (this makes a certain amount of sense because
@@ -194,12 +198,21 @@ struct aio_kiocb {
 
 	/* Fields used for threaded aio helper. */
 	struct task_struct	*ki_submit_task;
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	struct task_struct	*ki_cancel_task;
+	unsigned long		ki_rlimit_fsize;
+	aio_thread_work_fn_t	ki_work_fn;
+	struct work_struct	ki_work;
+#endif
 };
 
 /*------ sysctl variables----*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
 unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+unsigned long aio_auto_threads;	/* Currently disabled by default */
+#endif
 /*----end sysctl variables---*/
 
 static struct kmem_cache	*kiocb_cachep;
@@ -212,6 +225,15 @@ static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 
+static bool aio_may_use_threads(void)
+{
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	return !!(aio_auto_threads & 1);
+#else
+	return false;
+#endif
+}
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -528,6 +550,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 	ring->head = ring->tail = 0;
 	ring->magic = AIO_RING_MAGIC;
 	ring->compat_features = AIO_RING_COMPAT_FEATURES;
+	if (aio_may_use_threads())
+		ring->compat_features |= AIO_RING_COMPAT_THREADED;
 	ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
 	ring->header_length = sizeof(struct aio_ring);
 	kunmap_atomic(ring);
@@ -1436,6 +1460,202 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
 				len, UIO_FASTIOV, iovec, iter);
 }
 
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+/* aio_thread_queue_iocb_cancel_early:
+ *	Early stage cancellation helper function for threaded aios.  This
+ *	is used prior to the iocb being assigned to a worker thread.
+ */
+static int aio_thread_queue_iocb_cancel_early(struct kiocb *iocb)
+{
+	return 0;
+}
+
+/* aio_thread_queue_iocb_cancel:
+ *	Late stage cancellation method for threaded aios.  Once an iocb is
+ *	assigned to a worker thread, we use a fatal signal to interrupt an
+ *	in-progress operation.
+ */
+static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
+{
+	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, common);
+
+	if (iocb->ki_cancel_task) {
+		force_sig(SIGKILL, iocb->ki_cancel_task);
+		return 0;
+	}
+	return -EAGAIN;
+}
+
+/* aio_thread_fn:
+ *	Entry point for worker to perform threaded aio.  Handles issues
+ *	arising due to cancellation using signals.
+ */
+static void aio_thread_fn(struct work_struct *work)
+{
+	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	kiocb_cancel_fn *old_cancel;
+	long ret;
+
+	iocb->ki_cancel_task = current;
+	current->kiocb = &iocb->common;		/* For io_send_sig(). */
+	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
+
+	/* Check for early stage cancellation and switch to late stage
+	 * cancellation if it has not already occurred.
+	 */
+	old_cancel = cmpxchg(&iocb->ki_cancel,
+			     aio_thread_queue_iocb_cancel_early,
+			     aio_thread_queue_iocb_cancel);
+	if (old_cancel != KIOCB_CANCELLED)
+		ret = iocb->ki_work_fn(iocb);
+	else
+		ret = -EINTR;
+
+	current->kiocb = NULL;
+	if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
+		ret = -EINTR;
+
+	/* Completion serializes cancellation by taking ctx_lock, so
+	 * aio_complete() will not return until after force_sig() in
+	 * aio_thread_queue_iocb_cancel().  This should ensure that
+	 * the signal is pending before being flushed in this thread.
+	 */
+	aio_complete(&iocb->common, ret, 0);
+	if (fatal_signal_pending(current))
+		flush_signals(current);
+}
+
+#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+
+/* aio_thread_queue_iocb
+ *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
+ *	aio_kiocb for cancellation.  The caller must provide a function to
+ *	execute the operation in work_fn.  The flags may be provided as an
+ *	ored set AIO_THREAD_xxx.
+ */
+static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
+				     aio_thread_work_fn_t work_fn,
+				     unsigned flags)
+{
+	INIT_WORK(&iocb->ki_work, aio_thread_fn);
+	iocb->ki_work_fn = work_fn;
+	if (flags & AIO_THREAD_NEED_TASK) {
+		iocb->ki_submit_task = current;
+		get_task_struct(iocb->ki_submit_task);
+	}
+
+	/* Cancellation needs to be always available for operations performed
+	 * using helper threads.  Prior to the iocb being assigned to a worker
+	 * thread, we need to record that a cancellation has occurred.  We
+	 * can do this by having a minimal helper function that is recorded in
+	 * ki_cancel.
+	 */
+	kiocb_set_cancel_fn(&iocb->common, aio_thread_queue_iocb_cancel_early);
+	queue_work(system_long_wq, &iocb->ki_work);
+	return -EIOCBQUEUED;
+}
+
+static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
+{
+	struct file *filp;
+	long ret;
+
+	use_mm(iocb->ki_ctx->mm);
+	filp = iocb->common.ki_filp;
+
+	if (filp->f_op->read_iter) {
+		struct kiocb sync_kiocb;
+
+		init_sync_kiocb(&sync_kiocb, filp);
+		sync_kiocb.ki_pos = iocb->common.ki_pos;
+		ret = filp->f_op->read_iter(&sync_kiocb, &iocb->ki_iter);
+	} else if (filp->f_op->read)
+		ret = do_loop_readv_writev(filp, &iocb->ki_iter,
+					   &iocb->common.ki_pos,
+					   filp->f_op->read);
+	else
+		ret = -EINVAL;
+	unuse_mm(iocb->ki_ctx->mm);
+	return ret;
+}
+
+ssize_t generic_async_read_iter_non_direct(struct kiocb *iocb,
+					   struct iov_iter *iter)
+{
+	if ((iocb->ki_flags & IOCB_DIRECT) ||
+	    (iocb->ki_complete != aio_complete))
+		return iocb->ki_filp->f_op->read_iter(iocb, iter);
+	return generic_async_read_iter(iocb, iter);
+}
+EXPORT_SYMBOL(generic_async_read_iter_non_direct);
+
+ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+	if (iter != &req->ki_iter)
+		return -EINVAL;
+
+	return aio_thread_queue_iocb(req, aio_thread_op_read_iter,
+				     AIO_THREAD_NEED_TASK);
+}
+EXPORT_SYMBOL(generic_async_read_iter);
+
+static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
+{
+	u64 saved_rlim_fsize;
+	struct file *filp;
+	long ret;
+
+	use_mm(iocb->ki_ctx->mm);
+	filp = iocb->common.ki_filp;
+	saved_rlim_fsize = rlimit(RLIMIT_FSIZE);
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = iocb->ki_rlimit_fsize;
+
+	if (filp->f_op->write_iter) {
+		struct kiocb sync_kiocb;
+
+		init_sync_kiocb(&sync_kiocb, filp);
+		sync_kiocb.ki_pos = iocb->common.ki_pos;
+		ret = filp->f_op->write_iter(&sync_kiocb, &iocb->ki_iter);
+	} else if (filp->f_op->write)
+		ret = do_loop_readv_writev(filp, &iocb->ki_iter,
+					   &iocb->common.ki_pos,
+					   (io_fn_t)filp->f_op->write);
+	else
+		ret = -EINVAL;
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = saved_rlim_fsize;
+	unuse_mm(iocb->ki_ctx->mm);
+	return ret;
+}
+
+ssize_t generic_async_write_iter_non_direct(struct kiocb *iocb,
+					    struct iov_iter *iter)
+{
+	if ((iocb->ki_flags & IOCB_DIRECT) ||
+	    (iocb->ki_complete != aio_complete))
+		return iocb->ki_filp->f_op->write_iter(iocb, iter);
+	return generic_async_write_iter(iocb, iter);
+}
+EXPORT_SYMBOL(generic_async_write_iter_non_direct);
+
+ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+	if (iter != &req->ki_iter)
+		return -EINVAL;
+	req->ki_rlimit_fsize = rlimit(RLIMIT_FSIZE);
+
+	return aio_thread_queue_iocb(req, aio_thread_op_write_iter,
+				     AIO_THREAD_NEED_TASK);
+}
+EXPORT_SYMBOL(generic_async_write_iter);
+#endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
+
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
@@ -1454,6 +1674,14 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
 		rw	= READ;
+		iter_op	= file->f_op->async_read_iter;
+		if (iter_op)
+			goto rw_common;
+		if ((aio_may_use_threads()) &&
+		    (file->f_op->read_iter || file->f_op->read)) {
+			iter_op = generic_async_read_iter;
+			goto rw_common;
+		}
 		iter_op	= file->f_op->read_iter;
 		goto rw_common;
 
@@ -1461,6 +1689,14 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 	case IOCB_CMD_PWRITEV:
 		mode	= FMODE_WRITE;
 		rw	= WRITE;
+		iter_op	= file->f_op->async_write_iter;
+		if (iter_op)
+			goto rw_common;
+		if ((aio_may_use_threads()) &&
+		    (file->f_op->write_iter || file->f_op->write)) {
+			iter_op = generic_async_write_iter;
+			goto rw_common;
+		}
 		iter_op	= file->f_op->write_iter;
 		goto rw_common;
 rw_common:
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9a62e8a..7486f19 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -19,6 +19,9 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 struct mm_struct *aio_get_mm(struct kiocb *req);
 struct task_struct *aio_get_task(struct kiocb *req);
+struct iov_iter;
+ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -34,5 +37,6 @@ static inline struct task_struct *aio_get_task(struct kiocb *req)
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
+extern unsigned long aio_auto_threads;
 
 #endif /* __LINUX__AIO_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..b3dc406 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1604,6 +1604,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*async_read_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*async_write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2..33fb8b2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1575,6 +1575,19 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config AIO_THREAD
+	bool "Support kernel thread based AIO" if EXPERT
+	depends on AIO
+	default y
+	help
+	   This option enables using kernel thread based AIO which implements
+	   asynchronous operations using the kernel's queue_work() mechanism.
+	   The automatic use of threads for async operations is currently
+	   disabled by default until the security implications for usage
+	   are completely understood.  This functionality can be enabled at
+	   runtime if this option is enabled by setting the fs.aio-auto-threads
+	   to one.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index dc6858d..b5e3977 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1677,6 +1677,15 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	{
+		.procname	= "aio-auto-threads",
+		.data		= &aio_auto_threads,
+		.maxlen		= sizeof(aio_auto_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
+#endif
 #endif /* CONFIG_AIO */
 #ifdef CONFIG_INOTIFY_USER
 	{
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 06/13] aio: add queue_work() based threaded aio support
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Add support for performing asynchronous reads and writes via kernel
threads by way of the queue_work() functionality.  This enables fully
asynchronous and cancellable reads and writes for any file or device in
the kernel.  Cancellation is implemented by sending a SIGKILL to the
kernel thread executing the async operation.  So long as the read or
write operation can be interrupted by signals, the AIO request can be
cancelled.

This functionality is currently disabled by default until the DoS
implications of having user controlled kernel thread creation are fully
understood.  When compiled into the kernel, this functionality can be
enabled by setting the fs.aio-auto-threads sysctl to 1.  It is expected
that the feature will be enabled by default in a future kernel version.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c            | 236 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   4 +
 include/linux/fs.h  |   2 +
 init/Kconfig        |  13 +++
 kernel/sysctl.c     |   9 ++
 5 files changed, 264 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 55c8ff5..88af450 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -48,6 +48,7 @@
 
 #define AIO_RING_MAGIC			0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES	1
+#define AIO_RING_COMPAT_THREADED	2
 #define AIO_RING_INCOMPAT_FEATURES	0
 struct aio_ring {
 	unsigned	id;	/* kernel internal index number */
@@ -157,6 +158,9 @@ struct kioctx {
 	struct mm_struct	*mm;
 };
 
+struct aio_kiocb;
+typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
+
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
  * cancelled or completed (this makes a certain amount of sense because
@@ -194,12 +198,21 @@ struct aio_kiocb {
 
 	/* Fields used for threaded aio helper. */
 	struct task_struct	*ki_submit_task;
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	struct task_struct	*ki_cancel_task;
+	unsigned long		ki_rlimit_fsize;
+	aio_thread_work_fn_t	ki_work_fn;
+	struct work_struct	ki_work;
+#endif
 };
 
 /*------ sysctl variables----*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
 unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+unsigned long aio_auto_threads;	/* Currently disabled by default */
+#endif
 /*----end sysctl variables---*/
 
 static struct kmem_cache	*kiocb_cachep;
@@ -212,6 +225,15 @@ static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 
+static bool aio_may_use_threads(void)
+{
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	return !!(aio_auto_threads & 1);
+#else
+	return false;
+#endif
+}
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -528,6 +550,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 	ring->head = ring->tail = 0;
 	ring->magic = AIO_RING_MAGIC;
 	ring->compat_features = AIO_RING_COMPAT_FEATURES;
+	if (aio_may_use_threads())
+		ring->compat_features |= AIO_RING_COMPAT_THREADED;
 	ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
 	ring->header_length = sizeof(struct aio_ring);
 	kunmap_atomic(ring);
@@ -1436,6 +1460,202 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
 				len, UIO_FASTIOV, iovec, iter);
 }
 
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+/* aio_thread_queue_iocb_cancel_early:
+ *	Early stage cancellation helper function for threaded aios.  This
+ *	is used prior to the iocb being assigned to a worker thread.
+ */
+static int aio_thread_queue_iocb_cancel_early(struct kiocb *iocb)
+{
+	return 0;
+}
+
+/* aio_thread_queue_iocb_cancel:
+ *	Late stage cancellation method for threaded aios.  Once an iocb is
+ *	assigned to a worker thread, we use a fatal signal to interrupt an
+ *	in-progress operation.
+ */
+static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
+{
+	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, common);
+
+	if (iocb->ki_cancel_task) {
+		force_sig(SIGKILL, iocb->ki_cancel_task);
+		return 0;
+	}
+	return -EAGAIN;
+}
+
+/* aio_thread_fn:
+ *	Entry point for worker to perform threaded aio.  Handles issues
+ *	arising due to cancellation using signals.
+ */
+static void aio_thread_fn(struct work_struct *work)
+{
+	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	kiocb_cancel_fn *old_cancel;
+	long ret;
+
+	iocb->ki_cancel_task = current;
+	current->kiocb = &iocb->common;		/* For io_send_sig(). */
+	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
+
+	/* Check for early stage cancellation and switch to late stage
+	 * cancellation if it has not already occurred.
+	 */
+	old_cancel = cmpxchg(&iocb->ki_cancel,
+			     aio_thread_queue_iocb_cancel_early,
+			     aio_thread_queue_iocb_cancel);
+	if (old_cancel != KIOCB_CANCELLED)
+		ret = iocb->ki_work_fn(iocb);
+	else
+		ret = -EINTR;
+
+	current->kiocb = NULL;
+	if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
+		ret = -EINTR;
+
+	/* Completion serializes cancellation by taking ctx_lock, so
+	 * aio_complete() will not return until after force_sig() in
+	 * aio_thread_queue_iocb_cancel().  This should ensure that
+	 * the signal is pending before being flushed in this thread.
+	 */
+	aio_complete(&iocb->common, ret, 0);
+	if (fatal_signal_pending(current))
+		flush_signals(current);
+}
+
+#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+
+/* aio_thread_queue_iocb
+ *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
+ *	aio_kiocb for cancellation.  The caller must provide a function to
+ *	execute the operation in work_fn.  The flags may be provided as an
+ *	ored set AIO_THREAD_xxx.
+ */
+static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
+				     aio_thread_work_fn_t work_fn,
+				     unsigned flags)
+{
+	INIT_WORK(&iocb->ki_work, aio_thread_fn);
+	iocb->ki_work_fn = work_fn;
+	if (flags & AIO_THREAD_NEED_TASK) {
+		iocb->ki_submit_task = current;
+		get_task_struct(iocb->ki_submit_task);
+	}
+
+	/* Cancellation needs to be always available for operations performed
+	 * using helper threads.  Prior to the iocb being assigned to a worker
+	 * thread, we need to record that a cancellation has occurred.  We
+	 * can do this by having a minimal helper function that is recorded in
+	 * ki_cancel.
+	 */
+	kiocb_set_cancel_fn(&iocb->common, aio_thread_queue_iocb_cancel_early);
+	queue_work(system_long_wq, &iocb->ki_work);
+	return -EIOCBQUEUED;
+}
+
+static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
+{
+	struct file *filp;
+	long ret;
+
+	use_mm(iocb->ki_ctx->mm);
+	filp = iocb->common.ki_filp;
+
+	if (filp->f_op->read_iter) {
+		struct kiocb sync_kiocb;
+
+		init_sync_kiocb(&sync_kiocb, filp);
+		sync_kiocb.ki_pos = iocb->common.ki_pos;
+		ret = filp->f_op->read_iter(&sync_kiocb, &iocb->ki_iter);
+	} else if (filp->f_op->read)
+		ret = do_loop_readv_writev(filp, &iocb->ki_iter,
+					   &iocb->common.ki_pos,
+					   filp->f_op->read);
+	else
+		ret = -EINVAL;
+	unuse_mm(iocb->ki_ctx->mm);
+	return ret;
+}
+
+ssize_t generic_async_read_iter_non_direct(struct kiocb *iocb,
+					   struct iov_iter *iter)
+{
+	if ((iocb->ki_flags & IOCB_DIRECT) ||
+	    (iocb->ki_complete != aio_complete))
+		return iocb->ki_filp->f_op->read_iter(iocb, iter);
+	return generic_async_read_iter(iocb, iter);
+}
+EXPORT_SYMBOL(generic_async_read_iter_non_direct);
+
+ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+	if (iter != &req->ki_iter)
+		return -EINVAL;
+
+	return aio_thread_queue_iocb(req, aio_thread_op_read_iter,
+				     AIO_THREAD_NEED_TASK);
+}
+EXPORT_SYMBOL(generic_async_read_iter);
+
+static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
+{
+	u64 saved_rlim_fsize;
+	struct file *filp;
+	long ret;
+
+	use_mm(iocb->ki_ctx->mm);
+	filp = iocb->common.ki_filp;
+	saved_rlim_fsize = rlimit(RLIMIT_FSIZE);
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = iocb->ki_rlimit_fsize;
+
+	if (filp->f_op->write_iter) {
+		struct kiocb sync_kiocb;
+
+		init_sync_kiocb(&sync_kiocb, filp);
+		sync_kiocb.ki_pos = iocb->common.ki_pos;
+		ret = filp->f_op->write_iter(&sync_kiocb, &iocb->ki_iter);
+	} else if (filp->f_op->write)
+		ret = do_loop_readv_writev(filp, &iocb->ki_iter,
+					   &iocb->common.ki_pos,
+					   (io_fn_t)filp->f_op->write);
+	else
+		ret = -EINVAL;
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = saved_rlim_fsize;
+	unuse_mm(iocb->ki_ctx->mm);
+	return ret;
+}
+
+ssize_t generic_async_write_iter_non_direct(struct kiocb *iocb,
+					    struct iov_iter *iter)
+{
+	if ((iocb->ki_flags & IOCB_DIRECT) ||
+	    (iocb->ki_complete != aio_complete))
+		return iocb->ki_filp->f_op->write_iter(iocb, iter);
+	return generic_async_write_iter(iocb, iter);
+}
+EXPORT_SYMBOL(generic_async_write_iter_non_direct);
+
+ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+	if (iter != &req->ki_iter)
+		return -EINVAL;
+	req->ki_rlimit_fsize = rlimit(RLIMIT_FSIZE);
+
+	return aio_thread_queue_iocb(req, aio_thread_op_write_iter,
+				     AIO_THREAD_NEED_TASK);
+}
+EXPORT_SYMBOL(generic_async_write_iter);
+#endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
+
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
@@ -1454,6 +1674,14 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
 		rw	= READ;
+		iter_op	= file->f_op->async_read_iter;
+		if (iter_op)
+			goto rw_common;
+		if ((aio_may_use_threads()) &&
+		    (file->f_op->read_iter || file->f_op->read)) {
+			iter_op = generic_async_read_iter;
+			goto rw_common;
+		}
 		iter_op	= file->f_op->read_iter;
 		goto rw_common;
 
@@ -1461,6 +1689,14 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 	case IOCB_CMD_PWRITEV:
 		mode	= FMODE_WRITE;
 		rw	= WRITE;
+		iter_op	= file->f_op->async_write_iter;
+		if (iter_op)
+			goto rw_common;
+		if ((aio_may_use_threads()) &&
+		    (file->f_op->write_iter || file->f_op->write)) {
+			iter_op = generic_async_write_iter;
+			goto rw_common;
+		}
 		iter_op	= file->f_op->write_iter;
 		goto rw_common;
 rw_common:
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9a62e8a..7486f19 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -19,6 +19,9 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 struct mm_struct *aio_get_mm(struct kiocb *req);
 struct task_struct *aio_get_task(struct kiocb *req);
+struct iov_iter;
+ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -34,5 +37,6 @@ static inline struct task_struct *aio_get_task(struct kiocb *req)
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
+extern unsigned long aio_auto_threads;
 
 #endif /* __LINUX__AIO_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..b3dc406 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1604,6 +1604,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*async_read_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*async_write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2..33fb8b2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1575,6 +1575,19 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config AIO_THREAD
+	bool "Support kernel thread based AIO" if EXPERT
+	depends on AIO
+	default y
+	help
+	   This option enables using kernel thread based AIO which implements
+	   asynchronous operations using the kernel's queue_work() mechanism.
+	   The automatic use of threads for async operations is currently
+	   disabled by default until the security implications for usage
+	   are completely understood.  This functionality can be enabled at
+	   runtime if this option is enabled by setting the fs.aio-auto-threads
+	   to one.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index dc6858d..b5e3977 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1677,6 +1677,15 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	{
+		.procname	= "aio-auto-threads",
+		.data		= &aio_auto_threads,
+		.maxlen		= sizeof(aio_auto_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
+#endif
 #endif /* CONFIG_AIO */
 #ifdef CONFIG_INOTIFY_USER
 	{
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 06/13] aio: add queue_work() based threaded aio support
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Add support for performing asynchronous reads and writes via kernel
threads by way of the queue_work() functionality.  This enables fully
asynchronous and cancellable reads and writes for any file or device in
the kernel.  Cancellation is implemented by sending a SIGKILL to the
kernel thread executing the async operation.  So long as the read or
write operation can be interrupted by signals, the AIO request can be
cancelled.

This functionality is currently disabled by default until the DoS
implications of having user controlled kernel thread creation are fully
understood.  When compiled into the kernel, this functionality can be
enabled by setting the fs.aio-auto-threads sysctl to 1.  It is expected
that the feature will be enabled by default in a future kernel version.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c            | 236 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   4 +
 include/linux/fs.h  |   2 +
 init/Kconfig        |  13 +++
 kernel/sysctl.c     |   9 ++
 5 files changed, 264 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 55c8ff5..88af450 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -48,6 +48,7 @@
 
 #define AIO_RING_MAGIC			0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES	1
+#define AIO_RING_COMPAT_THREADED	2
 #define AIO_RING_INCOMPAT_FEATURES	0
 struct aio_ring {
 	unsigned	id;	/* kernel internal index number */
@@ -157,6 +158,9 @@ struct kioctx {
 	struct mm_struct	*mm;
 };
 
+struct aio_kiocb;
+typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
+
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
  * cancelled or completed (this makes a certain amount of sense because
@@ -194,12 +198,21 @@ struct aio_kiocb {
 
 	/* Fields used for threaded aio helper. */
 	struct task_struct	*ki_submit_task;
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	struct task_struct	*ki_cancel_task;
+	unsigned long		ki_rlimit_fsize;
+	aio_thread_work_fn_t	ki_work_fn;
+	struct work_struct	ki_work;
+#endif
 };
 
 /*------ sysctl variables----*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
 unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+unsigned long aio_auto_threads;	/* Currently disabled by default */
+#endif
 /*----end sysctl variables---*/
 
 static struct kmem_cache	*kiocb_cachep;
@@ -212,6 +225,15 @@ static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 
+static bool aio_may_use_threads(void)
+{
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	return !!(aio_auto_threads & 1);
+#else
+	return false;
+#endif
+}
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -528,6 +550,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 	ring->head = ring->tail = 0;
 	ring->magic = AIO_RING_MAGIC;
 	ring->compat_features = AIO_RING_COMPAT_FEATURES;
+	if (aio_may_use_threads())
+		ring->compat_features |= AIO_RING_COMPAT_THREADED;
 	ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
 	ring->header_length = sizeof(struct aio_ring);
 	kunmap_atomic(ring);
@@ -1436,6 +1460,202 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
 				len, UIO_FASTIOV, iovec, iter);
 }
 
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+/* aio_thread_queue_iocb_cancel_early:
+ *	Early stage cancellation helper function for threaded aios.  This
+ *	is used prior to the iocb being assigned to a worker thread.
+ */
+static int aio_thread_queue_iocb_cancel_early(struct kiocb *iocb)
+{
+	return 0;
+}
+
+/* aio_thread_queue_iocb_cancel:
+ *	Late stage cancellation method for threaded aios.  Once an iocb is
+ *	assigned to a worker thread, we use a fatal signal to interrupt an
+ *	in-progress operation.
+ */
+static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
+{
+	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, common);
+
+	if (iocb->ki_cancel_task) {
+		force_sig(SIGKILL, iocb->ki_cancel_task);
+		return 0;
+	}
+	return -EAGAIN;
+}
+
+/* aio_thread_fn:
+ *	Entry point for worker to perform threaded aio.  Handles issues
+ *	arising due to cancellation using signals.
+ */
+static void aio_thread_fn(struct work_struct *work)
+{
+	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	kiocb_cancel_fn *old_cancel;
+	long ret;
+
+	iocb->ki_cancel_task = current;
+	current->kiocb = &iocb->common;		/* For io_send_sig(). */
+	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
+
+	/* Check for early stage cancellation and switch to late stage
+	 * cancellation if it has not already occurred.
+	 */
+	old_cancel = cmpxchg(&iocb->ki_cancel,
+			     aio_thread_queue_iocb_cancel_early,
+			     aio_thread_queue_iocb_cancel);
+	if (old_cancel != KIOCB_CANCELLED)
+		ret = iocb->ki_work_fn(iocb);
+	else
+		ret = -EINTR;
+
+	current->kiocb = NULL;
+	if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
+		ret = -EINTR;
+
+	/* Completion serializes cancellation by taking ctx_lock, so
+	 * aio_complete() will not return until after force_sig() in
+	 * aio_thread_queue_iocb_cancel().  This should ensure that
+	 * the signal is pending before being flushed in this thread.
+	 */
+	aio_complete(&iocb->common, ret, 0);
+	if (fatal_signal_pending(current))
+		flush_signals(current);
+}
+
+#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+
+/* aio_thread_queue_iocb
+ *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
+ *	aio_kiocb for cancellation.  The caller must provide a function to
+ *	execute the operation in work_fn.  The flags may be provided as an
+ *	ored set AIO_THREAD_xxx.
+ */
+static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
+				     aio_thread_work_fn_t work_fn,
+				     unsigned flags)
+{
+	INIT_WORK(&iocb->ki_work, aio_thread_fn);
+	iocb->ki_work_fn = work_fn;
+	if (flags & AIO_THREAD_NEED_TASK) {
+		iocb->ki_submit_task = current;
+		get_task_struct(iocb->ki_submit_task);
+	}
+
+	/* Cancellation needs to be always available for operations performed
+	 * using helper threads.  Prior to the iocb being assigned to a worker
+	 * thread, we need to record that a cancellation has occurred.  We
+	 * can do this by having a minimal helper function that is recorded in
+	 * ki_cancel.
+	 */
+	kiocb_set_cancel_fn(&iocb->common, aio_thread_queue_iocb_cancel_early);
+	queue_work(system_long_wq, &iocb->ki_work);
+	return -EIOCBQUEUED;
+}
+
+static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
+{
+	struct file *filp;
+	long ret;
+
+	use_mm(iocb->ki_ctx->mm);
+	filp = iocb->common.ki_filp;
+
+	if (filp->f_op->read_iter) {
+		struct kiocb sync_kiocb;
+
+		init_sync_kiocb(&sync_kiocb, filp);
+		sync_kiocb.ki_pos = iocb->common.ki_pos;
+		ret = filp->f_op->read_iter(&sync_kiocb, &iocb->ki_iter);
+	} else if (filp->f_op->read)
+		ret = do_loop_readv_writev(filp, &iocb->ki_iter,
+					   &iocb->common.ki_pos,
+					   filp->f_op->read);
+	else
+		ret = -EINVAL;
+	unuse_mm(iocb->ki_ctx->mm);
+	return ret;
+}
+
+ssize_t generic_async_read_iter_non_direct(struct kiocb *iocb,
+					   struct iov_iter *iter)
+{
+	if ((iocb->ki_flags & IOCB_DIRECT) ||
+	    (iocb->ki_complete != aio_complete))
+		return iocb->ki_filp->f_op->read_iter(iocb, iter);
+	return generic_async_read_iter(iocb, iter);
+}
+EXPORT_SYMBOL(generic_async_read_iter_non_direct);
+
+ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+	if (iter != &req->ki_iter)
+		return -EINVAL;
+
+	return aio_thread_queue_iocb(req, aio_thread_op_read_iter,
+				     AIO_THREAD_NEED_TASK);
+}
+EXPORT_SYMBOL(generic_async_read_iter);
+
+static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
+{
+	u64 saved_rlim_fsize;
+	struct file *filp;
+	long ret;
+
+	use_mm(iocb->ki_ctx->mm);
+	filp = iocb->common.ki_filp;
+	saved_rlim_fsize = rlimit(RLIMIT_FSIZE);
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = iocb->ki_rlimit_fsize;
+
+	if (filp->f_op->write_iter) {
+		struct kiocb sync_kiocb;
+
+		init_sync_kiocb(&sync_kiocb, filp);
+		sync_kiocb.ki_pos = iocb->common.ki_pos;
+		ret = filp->f_op->write_iter(&sync_kiocb, &iocb->ki_iter);
+	} else if (filp->f_op->write)
+		ret = do_loop_readv_writev(filp, &iocb->ki_iter,
+					   &iocb->common.ki_pos,
+					   (io_fn_t)filp->f_op->write);
+	else
+		ret = -EINVAL;
+	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = saved_rlim_fsize;
+	unuse_mm(iocb->ki_ctx->mm);
+	return ret;
+}
+
+ssize_t generic_async_write_iter_non_direct(struct kiocb *iocb,
+					    struct iov_iter *iter)
+{
+	if ((iocb->ki_flags & IOCB_DIRECT) ||
+	    (iocb->ki_complete != aio_complete))
+		return iocb->ki_filp->f_op->write_iter(iocb, iter);
+	return generic_async_write_iter(iocb, iter);
+}
+EXPORT_SYMBOL(generic_async_write_iter_non_direct);
+
+ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+	if (iter != &req->ki_iter)
+		return -EINVAL;
+	req->ki_rlimit_fsize = rlimit(RLIMIT_FSIZE);
+
+	return aio_thread_queue_iocb(req, aio_thread_op_write_iter,
+				     AIO_THREAD_NEED_TASK);
+}
+EXPORT_SYMBOL(generic_async_write_iter);
+#endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
+
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
@@ -1454,6 +1674,14 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
 		rw	= READ;
+		iter_op	= file->f_op->async_read_iter;
+		if (iter_op)
+			goto rw_common;
+		if ((aio_may_use_threads()) &&
+		    (file->f_op->read_iter || file->f_op->read)) {
+			iter_op = generic_async_read_iter;
+			goto rw_common;
+		}
 		iter_op	= file->f_op->read_iter;
 		goto rw_common;
 
@@ -1461,6 +1689,14 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 	case IOCB_CMD_PWRITEV:
 		mode	= FMODE_WRITE;
 		rw	= WRITE;
+		iter_op	= file->f_op->async_write_iter;
+		if (iter_op)
+			goto rw_common;
+		if ((aio_may_use_threads()) &&
+		    (file->f_op->write_iter || file->f_op->write)) {
+			iter_op = generic_async_write_iter;
+			goto rw_common;
+		}
 		iter_op	= file->f_op->write_iter;
 		goto rw_common;
 rw_common:
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9a62e8a..7486f19 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -19,6 +19,9 @@ extern long do_io_submit(aio_context_t ctx_id, long nr,
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 struct mm_struct *aio_get_mm(struct kiocb *req);
 struct task_struct *aio_get_task(struct kiocb *req);
+struct iov_iter;
+ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter);
 #else
 static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
@@ -34,5 +37,6 @@ static inline struct task_struct *aio_get_task(struct kiocb *req)
 /* for sysctl: */
 extern unsigned long aio_nr;
 extern unsigned long aio_max_nr;
+extern unsigned long aio_auto_threads;
 
 #endif /* __LINUX__AIO_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..b3dc406 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1604,6 +1604,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*async_read_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*async_write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2..33fb8b2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1575,6 +1575,19 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config AIO_THREAD
+	bool "Support kernel thread based AIO" if EXPERT
+	depends on AIO
+	default y
+	help
+	   This option enables using kernel thread based AIO which implements
+	   asynchronous operations using the kernel's queue_work() mechanism.
+	   The automatic use of threads for async operations is currently
+	   disabled by default until the security implications for usage
+	   are completely understood.  This functionality can be enabled at
+	   runtime if this option is enabled by setting the fs.aio-auto-threads
+	   to one.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index dc6858d..b5e3977 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1677,6 +1677,15 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	{
+		.procname	= "aio-auto-threads",
+		.data		= &aio_auto_threads,
+		.maxlen		= sizeof(aio_auto_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
+#endif
 #endif /* CONFIG_AIO */
 #ifdef CONFIG_INOTIFY_USER
 	{
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Enable a fully asynchronous fsync and fdatasync operations in aio using
the aio thread queuing mechanism.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 88af450..576b780 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -224,8 +224,9 @@ static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
+ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 
-static bool aio_may_use_threads(void)
+static __always_inline bool aio_may_use_threads(void)
 {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	return !!(aio_auto_threads & 1);
@@ -1654,6 +1655,26 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 				     AIO_THREAD_NEED_TASK);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
+
+static long aio_thread_op_fsync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 0);
+}
+
+static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 1);
+}
+
+ssize_t aio_fsync(struct kiocb *iocb, int datasync)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+
+	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
+						   : aio_thread_op_fsync, 0);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1664,7 +1685,7 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
 	struct file *file = req->common.ki_filp;
-	ssize_t ret;
+	ssize_t ret = -EINVAL;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
@@ -1730,17 +1751,17 @@ rw_common:
 		break;
 
 	case IOCB_CMD_FDSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 1);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 1);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 0);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 0);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 0);
 		break;
 
 	default:
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Enable a fully asynchronous fsync and fdatasync operations in aio using
the aio thread queuing mechanism.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 88af450..576b780 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -224,8 +224,9 @@ static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
+ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 
-static bool aio_may_use_threads(void)
+static __always_inline bool aio_may_use_threads(void)
 {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	return !!(aio_auto_threads & 1);
@@ -1654,6 +1655,26 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 				     AIO_THREAD_NEED_TASK);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
+
+static long aio_thread_op_fsync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 0);
+}
+
+static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 1);
+}
+
+ssize_t aio_fsync(struct kiocb *iocb, int datasync)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+
+	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
+						   : aio_thread_op_fsync, 0);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1664,7 +1685,7 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
 	struct file *file = req->common.ki_filp;
-	ssize_t ret;
+	ssize_t ret = -EINVAL;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
@@ -1730,17 +1751,17 @@ rw_common:
 		break;
 
 	case IOCB_CMD_FDSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 1);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 1);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 0);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 0);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 0);
 		break;
 
 	default:
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Enable a fully asynchronous fsync and fdatasync operations in aio using
the aio thread queuing mechanism.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 88af450..576b780 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -224,8 +224,9 @@ static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
+ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 
-static bool aio_may_use_threads(void)
+static __always_inline bool aio_may_use_threads(void)
 {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	return !!(aio_auto_threads & 1);
@@ -1654,6 +1655,26 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 				     AIO_THREAD_NEED_TASK);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
+
+static long aio_thread_op_fsync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 0);
+}
+
+static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 1);
+}
+
+ssize_t aio_fsync(struct kiocb *iocb, int datasync)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+
+	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
+						   : aio_thread_op_fsync, 0);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1664,7 +1685,7 @@ static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
 	struct file *file = req->common.ki_filp;
-	ssize_t ret;
+	ssize_t ret = -EINVAL;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
@@ -1730,17 +1751,17 @@ rw_common:
 		break;
 
 	case IOCB_CMD_FDSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 1);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 1);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 0);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 0);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 0);
 		break;
 
 	default:
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 08/13] aio: add support for aio poll via aio thread helper
  2016-01-11 22:06 ` Benjamin LaHaise
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Applications that require a unified event loop occasionally have a need
to interface with libraries or other code that require notification on a
file descriptor becoming ready for read or write via poll.  Add support
for the aio poll operation to enable these use-cases by way of the
thread based aio helpers.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 46 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |  2 +-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 576b780..4384df4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -200,6 +200,7 @@ struct aio_kiocb {
 	struct task_struct	*ki_submit_task;
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	struct task_struct	*ki_cancel_task;
+	unsigned long		ki_data;
 	unsigned long		ki_rlimit_fsize;
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
@@ -225,6 +226,7 @@ static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
+long aio_poll(struct aio_kiocb *iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1675,6 +1677,45 @@ ssize_t aio_fsync(struct kiocb *iocb, int datasync)
 	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
 						   : aio_thread_op_fsync, 0);
 }
+
+static long aio_thread_op_poll(struct aio_kiocb *iocb)
+{
+	struct file *file = iocb->common.ki_filp;
+	short events = iocb->ki_data;
+	struct poll_wqueues table;
+	unsigned int mask;
+	ssize_t ret = 0;
+
+	poll_initwait(&table);
+	events |= POLLERR | POLLHUP;
+
+	for (;;) {
+		mask = DEFAULT_POLLMASK;
+		if (file->f_op && file->f_op->poll) {
+			table.pt._key = events;
+			mask = file->f_op->poll(file, &table.pt);
+		}
+		/* Mask out unneeded events. */
+		mask &= events;
+		ret = mask;
+		if (mask)
+			break;
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+
+		poll_schedule_timeout(&table, TASK_INTERRUPTIBLE, NULL, 0);
+	}
+
+	poll_freewait(&table);
+	return ret;
+}
+
+long aio_poll(struct aio_kiocb *req)
+{
+	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1764,6 +1805,11 @@ rw_common:
 			ret = aio_fsync(&req->common, 0);
 		break;
 
+	case IOCB_CMD_POLL:
+		if (aio_may_use_threads())
+			ret = aio_poll(req);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..7639fb1 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -39,8 +39,8 @@ enum {
 	IOCB_CMD_FDSYNC = 3,
 	/* These two are experimental.
 	 * IOCB_CMD_PREADX = 4,
-	 * IOCB_CMD_POLL = 5,
 	 */
+	IOCB_CMD_POLL = 5,
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 08/13] aio: add support for aio poll via aio thread helper
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Applications that require a unified event loop occasionally have a need
to interface with libraries or other code that require notification on a
file descriptor becoming ready for read or write via poll.  Add support
for the aio poll operation to enable these use-cases by way of the
thread based aio helpers.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 46 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |  2 +-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 576b780..4384df4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -200,6 +200,7 @@ struct aio_kiocb {
 	struct task_struct	*ki_submit_task;
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	struct task_struct	*ki_cancel_task;
+	unsigned long		ki_data;
 	unsigned long		ki_rlimit_fsize;
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
@@ -225,6 +226,7 @@ static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
+long aio_poll(struct aio_kiocb *iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1675,6 +1677,45 @@ ssize_t aio_fsync(struct kiocb *iocb, int datasync)
 	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
 						   : aio_thread_op_fsync, 0);
 }
+
+static long aio_thread_op_poll(struct aio_kiocb *iocb)
+{
+	struct file *file = iocb->common.ki_filp;
+	short events = iocb->ki_data;
+	struct poll_wqueues table;
+	unsigned int mask;
+	ssize_t ret = 0;
+
+	poll_initwait(&table);
+	events |= POLLERR | POLLHUP;
+
+	for (;;) {
+		mask = DEFAULT_POLLMASK;
+		if (file->f_op && file->f_op->poll) {
+			table.pt._key = events;
+			mask = file->f_op->poll(file, &table.pt);
+		}
+		/* Mask out unneeded events. */
+		mask &= events;
+		ret = mask;
+		if (mask)
+			break;
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+
+		poll_schedule_timeout(&table, TASK_INTERRUPTIBLE, NULL, 0);
+	}
+
+	poll_freewait(&table);
+	return ret;
+}
+
+long aio_poll(struct aio_kiocb *req)
+{
+	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1764,6 +1805,11 @@ rw_common:
 			ret = aio_fsync(&req->common, 0);
 		break;
 
+	case IOCB_CMD_POLL:
+		if (aio_may_use_threads())
+			ret = aio_poll(req);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..7639fb1 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -39,8 +39,8 @@ enum {
 	IOCB_CMD_FDSYNC = 3,
 	/* These two are experimental.
 	 * IOCB_CMD_PREADX = 4,
-	 * IOCB_CMD_POLL = 5,
 	 */
+	IOCB_CMD_POLL = 5,
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 09/13] aio: add support for async openat()
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Another blocking operation used by applications that want aio
functionality is that of opening files that are not resident in memory.
Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 120 +++++++++++++++++++++++++++++++++++++------
 include/uapi/linux/aio_abi.h |   2 +
 2 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 4384df4..346786b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,8 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/fdtable.h>
+#include <linux/fs_struct.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -204,6 +206,9 @@ struct aio_kiocb {
 	unsigned long		ki_rlimit_fsize;
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
+	struct fs_struct	*ki_fs;
+	struct files_struct	*ki_files;
+	const struct cred	*ki_cred;
 #endif
 };
 
@@ -227,6 +232,7 @@ static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
+long aio_openat(struct aio_kiocb *req);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1496,6 +1502,9 @@ static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
 static void aio_thread_fn(struct work_struct *work)
 {
 	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	struct files_struct *old_files = current->files;
+	const struct cred *old_cred = current_cred();
+	struct fs_struct *old_fs = current->fs;
 	kiocb_cancel_fn *old_cancel;
 	long ret;
 
@@ -1503,6 +1512,13 @@ static void aio_thread_fn(struct work_struct *work)
 	current->kiocb = &iocb->common;		/* For io_send_sig(). */
 	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
 
+	if (iocb->ki_fs)
+		current->fs = iocb->ki_fs;
+	if (iocb->ki_files)
+		current->files = iocb->ki_files;
+	if (iocb->ki_cred)
+		current->cred = iocb->ki_cred;
+
 	/* Check for early stage cancellation and switch to late stage
 	 * cancellation if it has not already occurred.
 	 */
@@ -1519,6 +1535,19 @@ static void aio_thread_fn(struct work_struct *work)
 		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
 		ret = -EINTR;
 
+	if (iocb->ki_cred) {
+		current->cred = old_cred;
+		put_cred(iocb->ki_cred);
+	}
+	if (iocb->ki_files) {
+		current->files = old_files;
+		put_files_struct(iocb->ki_files);
+	}
+	if (iocb->ki_fs) {
+		exit_fs(current);
+		current->fs = old_fs;
+	}
+
 	/* Completion serializes cancellation by taking ctx_lock, so
 	 * aio_complete() will not return until after force_sig() in
 	 * aio_thread_queue_iocb_cancel().  This should ensure that
@@ -1530,6 +1559,9 @@ static void aio_thread_fn(struct work_struct *work)
 }
 
 #define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
 
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
@@ -1547,6 +1579,20 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 		iocb->ki_submit_task = current;
 		get_task_struct(iocb->ki_submit_task);
 	}
+	if (flags & AIO_THREAD_NEED_FS) {
+		struct fs_struct *fs = current->fs;
+
+		iocb->ki_fs = fs;
+		spin_lock(&fs->lock);
+		fs->users++;
+		spin_unlock(&fs->lock);
+	}
+	if (flags & AIO_THREAD_NEED_FILES) {
+		iocb->ki_files = current->files;
+		atomic_inc(&iocb->ki_files->count);
+	}
+	if (flags & AIO_THREAD_NEED_CRED)
+		iocb->ki_cred = get_current_cred();
 
 	/* Cancellation needs to be always available for operations performed
 	 * using helper threads.  Prior to the iocb being assigned to a worker
@@ -1716,22 +1762,54 @@ long aio_poll(struct aio_kiocb *req)
 {
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
+
+static long aio_thread_op_openat(struct aio_kiocb *req)
+{
+	u64 buf, offset;
+	long ret;
+	u32 fd;
+
+	use_mm(req->ki_ctx->mm);
+	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
+		ret = -EFAULT;
+	else {
+		ret = do_sys_open((s32)fd,
+				  (const char __user *)(long)buf,
+				  (int)offset,
+				  (unsigned short)(offset >> 32));
+	}
+	unuse_mm(req->ki_ctx->mm);
+	return ret;
+}
+
+long aio_openat(struct aio_kiocb *req)
+{
+	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
-			    char __user *buf, size_t len, bool compat)
+static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
+			    bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
+	char __user *buf;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
 
-	switch (opcode) {
+	switch (user_iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
@@ -1768,12 +1846,17 @@ rw_common:
 		if (!iter_op)
 			return -EINVAL;
 
-		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
-			ret = aio_setup_vectored_rw(rw, buf, len,
+		buf = (char __user *)(unsigned long)user_iocb->aio_buf;
+		if (user_iocb->aio_lio_opcode == IOCB_CMD_PREADV ||
+		    user_iocb->aio_lio_opcode == IOCB_CMD_PWRITEV)
+			ret = aio_setup_vectored_rw(rw, buf,
+						    user_iocb->aio_nbytes,
 						    &req->ki_iovec, compat,
 						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, req->ki_iovec,
+			ret = import_single_range(rw, buf,
+						  user_iocb->aio_nbytes,
+						  req->ki_iovec,
 						  &req->ki_iter);
 		}
 		if (!ret)
@@ -1810,6 +1893,11 @@ rw_common:
 			ret = aio_poll(req);
 		break;
 
+	case IOCB_CMD_OPENAT:
+		if (aio_may_use_threads())
+			ret = aio_openat(req);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
@@ -1856,14 +1944,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	req->common.ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->common.ki_filp)) {
-		ret = -EBADF;
-		goto out_put_req;
+	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
+		req->common.ki_filp = NULL;
+	else {
+		req->common.ki_filp = fget(iocb->aio_fildes);
+		if (unlikely(!req->common.ki_filp)) {
+			ret = -EBADF;
+			goto out_put_req;
+		}
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	req->common.ki_flags = iocb_flags(req->common.ki_filp);
+	if (req->common.ki_filp)
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -1891,10 +1984,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
-			   (char __user *)(unsigned long)iocb->aio_buf,
-			   iocb->aio_nbytes,
-			   compat);
+	ret = aio_run_iocb(req, iocb, compat);
 	if (ret)
 		goto out_put_req;
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 7639fb1..0e16988 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+
+	IOCB_CMD_OPENAT = 9,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 09/13] aio: add support for async openat()
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Another blocking operation used by applications that want aio
functionality is that of opening files that are not resident in memory.
Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 120 +++++++++++++++++++++++++++++++++++++------
 include/uapi/linux/aio_abi.h |   2 +
 2 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 4384df4..346786b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,8 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/fdtable.h>
+#include <linux/fs_struct.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -204,6 +206,9 @@ struct aio_kiocb {
 	unsigned long		ki_rlimit_fsize;
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
+	struct fs_struct	*ki_fs;
+	struct files_struct	*ki_files;
+	const struct cred	*ki_cred;
 #endif
 };
 
@@ -227,6 +232,7 @@ static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
+long aio_openat(struct aio_kiocb *req);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1496,6 +1502,9 @@ static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
 static void aio_thread_fn(struct work_struct *work)
 {
 	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	struct files_struct *old_files = current->files;
+	const struct cred *old_cred = current_cred();
+	struct fs_struct *old_fs = current->fs;
 	kiocb_cancel_fn *old_cancel;
 	long ret;
 
@@ -1503,6 +1512,13 @@ static void aio_thread_fn(struct work_struct *work)
 	current->kiocb = &iocb->common;		/* For io_send_sig(). */
 	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
 
+	if (iocb->ki_fs)
+		current->fs = iocb->ki_fs;
+	if (iocb->ki_files)
+		current->files = iocb->ki_files;
+	if (iocb->ki_cred)
+		current->cred = iocb->ki_cred;
+
 	/* Check for early stage cancellation and switch to late stage
 	 * cancellation if it has not already occurred.
 	 */
@@ -1519,6 +1535,19 @@ static void aio_thread_fn(struct work_struct *work)
 		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
 		ret = -EINTR;
 
+	if (iocb->ki_cred) {
+		current->cred = old_cred;
+		put_cred(iocb->ki_cred);
+	}
+	if (iocb->ki_files) {
+		current->files = old_files;
+		put_files_struct(iocb->ki_files);
+	}
+	if (iocb->ki_fs) {
+		exit_fs(current);
+		current->fs = old_fs;
+	}
+
 	/* Completion serializes cancellation by taking ctx_lock, so
 	 * aio_complete() will not return until after force_sig() in
 	 * aio_thread_queue_iocb_cancel().  This should ensure that
@@ -1530,6 +1559,9 @@ static void aio_thread_fn(struct work_struct *work)
 }
 
 #define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
 
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
@@ -1547,6 +1579,20 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 		iocb->ki_submit_task = current;
 		get_task_struct(iocb->ki_submit_task);
 	}
+	if (flags & AIO_THREAD_NEED_FS) {
+		struct fs_struct *fs = current->fs;
+
+		iocb->ki_fs = fs;
+		spin_lock(&fs->lock);
+		fs->users++;
+		spin_unlock(&fs->lock);
+	}
+	if (flags & AIO_THREAD_NEED_FILES) {
+		iocb->ki_files = current->files;
+		atomic_inc(&iocb->ki_files->count);
+	}
+	if (flags & AIO_THREAD_NEED_CRED)
+		iocb->ki_cred = get_current_cred();
 
 	/* Cancellation needs to be always available for operations performed
 	 * using helper threads.  Prior to the iocb being assigned to a worker
@@ -1716,22 +1762,54 @@ long aio_poll(struct aio_kiocb *req)
 {
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
+
+static long aio_thread_op_openat(struct aio_kiocb *req)
+{
+	u64 buf, offset;
+	long ret;
+	u32 fd;
+
+	use_mm(req->ki_ctx->mm);
+	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
+		ret = -EFAULT;
+	else {
+		ret = do_sys_open((s32)fd,
+				  (const char __user *)(long)buf,
+				  (int)offset,
+				  (unsigned short)(offset >> 32));
+	}
+	unuse_mm(req->ki_ctx->mm);
+	return ret;
+}
+
+long aio_openat(struct aio_kiocb *req)
+{
+	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
-			    char __user *buf, size_t len, bool compat)
+static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
+			    bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
+	char __user *buf;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
 
-	switch (opcode) {
+	switch (user_iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
@@ -1768,12 +1846,17 @@ rw_common:
 		if (!iter_op)
 			return -EINVAL;
 
-		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
-			ret = aio_setup_vectored_rw(rw, buf, len,
+		buf = (char __user *)(unsigned long)user_iocb->aio_buf;
+		if (user_iocb->aio_lio_opcode == IOCB_CMD_PREADV ||
+		    user_iocb->aio_lio_opcode == IOCB_CMD_PWRITEV)
+			ret = aio_setup_vectored_rw(rw, buf,
+						    user_iocb->aio_nbytes,
 						    &req->ki_iovec, compat,
 						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, req->ki_iovec,
+			ret = import_single_range(rw, buf,
+						  user_iocb->aio_nbytes,
+						  req->ki_iovec,
 						  &req->ki_iter);
 		}
 		if (!ret)
@@ -1810,6 +1893,11 @@ rw_common:
 			ret = aio_poll(req);
 		break;
 
+	case IOCB_CMD_OPENAT:
+		if (aio_may_use_threads())
+			ret = aio_openat(req);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
@@ -1856,14 +1944,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	req->common.ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->common.ki_filp)) {
-		ret = -EBADF;
-		goto out_put_req;
+	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
+		req->common.ki_filp = NULL;
+	else {
+		req->common.ki_filp = fget(iocb->aio_fildes);
+		if (unlikely(!req->common.ki_filp)) {
+			ret = -EBADF;
+			goto out_put_req;
+		}
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	req->common.ki_flags = iocb_flags(req->common.ki_filp);
+	if (req->common.ki_filp)
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -1891,10 +1984,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
-			   (char __user *)(unsigned long)iocb->aio_buf,
-			   iocb->aio_nbytes,
-			   compat);
+	ret = aio_run_iocb(req, iocb, compat);
 	if (ret)
 		goto out_put_req;
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 7639fb1..0e16988 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+
+	IOCB_CMD_OPENAT = 9,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 09/13] aio: add support for async openat()
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Another blocking operation used by applications that want aio
functionality is that of opening files that are not resident in memory.
Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 120 +++++++++++++++++++++++++++++++++++++------
 include/uapi/linux/aio_abi.h |   2 +
 2 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 4384df4..346786b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,8 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/fdtable.h>
+#include <linux/fs_struct.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -204,6 +206,9 @@ struct aio_kiocb {
 	unsigned long		ki_rlimit_fsize;
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
+	struct fs_struct	*ki_fs;
+	struct files_struct	*ki_files;
+	const struct cred	*ki_cred;
 #endif
 };
 
@@ -227,6 +232,7 @@ static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
+long aio_openat(struct aio_kiocb *req);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1496,6 +1502,9 @@ static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
 static void aio_thread_fn(struct work_struct *work)
 {
 	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	struct files_struct *old_files = current->files;
+	const struct cred *old_cred = current_cred();
+	struct fs_struct *old_fs = current->fs;
 	kiocb_cancel_fn *old_cancel;
 	long ret;
 
@@ -1503,6 +1512,13 @@ static void aio_thread_fn(struct work_struct *work)
 	current->kiocb = &iocb->common;		/* For io_send_sig(). */
 	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
 
+	if (iocb->ki_fs)
+		current->fs = iocb->ki_fs;
+	if (iocb->ki_files)
+		current->files = iocb->ki_files;
+	if (iocb->ki_cred)
+		current->cred = iocb->ki_cred;
+
 	/* Check for early stage cancellation and switch to late stage
 	 * cancellation if it has not already occurred.
 	 */
@@ -1519,6 +1535,19 @@ static void aio_thread_fn(struct work_struct *work)
 		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
 		ret = -EINTR;
 
+	if (iocb->ki_cred) {
+		current->cred = old_cred;
+		put_cred(iocb->ki_cred);
+	}
+	if (iocb->ki_files) {
+		current->files = old_files;
+		put_files_struct(iocb->ki_files);
+	}
+	if (iocb->ki_fs) {
+		exit_fs(current);
+		current->fs = old_fs;
+	}
+
 	/* Completion serializes cancellation by taking ctx_lock, so
 	 * aio_complete() will not return until after force_sig() in
 	 * aio_thread_queue_iocb_cancel().  This should ensure that
@@ -1530,6 +1559,9 @@ static void aio_thread_fn(struct work_struct *work)
 }
 
 #define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
 
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
@@ -1547,6 +1579,20 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 		iocb->ki_submit_task = current;
 		get_task_struct(iocb->ki_submit_task);
 	}
+	if (flags & AIO_THREAD_NEED_FS) {
+		struct fs_struct *fs = current->fs;
+
+		iocb->ki_fs = fs;
+		spin_lock(&fs->lock);
+		fs->users++;
+		spin_unlock(&fs->lock);
+	}
+	if (flags & AIO_THREAD_NEED_FILES) {
+		iocb->ki_files = current->files;
+		atomic_inc(&iocb->ki_files->count);
+	}
+	if (flags & AIO_THREAD_NEED_CRED)
+		iocb->ki_cred = get_current_cred();
 
 	/* Cancellation needs to be always available for operations performed
 	 * using helper threads.  Prior to the iocb being assigned to a worker
@@ -1716,22 +1762,54 @@ long aio_poll(struct aio_kiocb *req)
 {
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
+
+static long aio_thread_op_openat(struct aio_kiocb *req)
+{
+	u64 buf, offset;
+	long ret;
+	u32 fd;
+
+	use_mm(req->ki_ctx->mm);
+	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
+		ret = -EFAULT;
+	else {
+		ret = do_sys_open((s32)fd,
+				  (const char __user *)(long)buf,
+				  (int)offset,
+				  (unsigned short)(offset >> 32));
+	}
+	unuse_mm(req->ki_ctx->mm);
+	return ret;
+}
+
+long aio_openat(struct aio_kiocb *req)
+{
+	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
-			    char __user *buf, size_t len, bool compat)
+static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
+			    bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
+	char __user *buf;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
 
-	switch (opcode) {
+	switch (user_iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
@@ -1768,12 +1846,17 @@ rw_common:
 		if (!iter_op)
 			return -EINVAL;
 
-		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
-			ret = aio_setup_vectored_rw(rw, buf, len,
+		buf = (char __user *)(unsigned long)user_iocb->aio_buf;
+		if (user_iocb->aio_lio_opcode == IOCB_CMD_PREADV ||
+		    user_iocb->aio_lio_opcode == IOCB_CMD_PWRITEV)
+			ret = aio_setup_vectored_rw(rw, buf,
+						    user_iocb->aio_nbytes,
 						    &req->ki_iovec, compat,
 						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, req->ki_iovec,
+			ret = import_single_range(rw, buf,
+						  user_iocb->aio_nbytes,
+						  req->ki_iovec,
 						  &req->ki_iter);
 		}
 		if (!ret)
@@ -1810,6 +1893,11 @@ rw_common:
 			ret = aio_poll(req);
 		break;
 
+	case IOCB_CMD_OPENAT:
+		if (aio_may_use_threads())
+			ret = aio_openat(req);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
@@ -1856,14 +1944,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	req->common.ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->common.ki_filp)) {
-		ret = -EBADF;
-		goto out_put_req;
+	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
+		req->common.ki_filp = NULL;
+	else {
+		req->common.ki_filp = fget(iocb->aio_fildes);
+		if (unlikely(!req->common.ki_filp)) {
+			ret = -EBADF;
+			goto out_put_req;
+		}
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	req->common.ki_flags = iocb_flags(req->common.ki_filp);
+	if (req->common.ki_filp)
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -1891,10 +1984,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
-			   (char __user *)(unsigned long)iocb->aio_buf,
-			   iocb->aio_nbytes,
-			   compat);
+	ret = aio_run_iocb(req, iocb, compat);
 	if (ret)
 		goto out_put_req;
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 7639fb1..0e16988 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+
+	IOCB_CMD_OPENAT = 9,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 10/13] aio: add async unlinkat functionality
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Enable asynchronous deletion of files by adding support for an aio
unlinkat operation.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 42 +++++++++++++++++++++++++++++++++---------
 fs/namei.c                   |  2 +-
 include/linux/fs.h           |  1 +
 include/uapi/linux/aio_abi.h |  1 +
 4 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 346786b..3a70492 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -232,7 +232,11 @@ static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
-long aio_openat(struct aio_kiocb *req);
+
+typedef long (*do_foo_at_t)(int fd, const char *filename, int flags, int mode);
+long aio_do_openat(int fd, const char *filename, int flags, int mode);
+long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
+long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1763,7 +1767,19 @@ long aio_poll(struct aio_kiocb *req)
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-static long aio_thread_op_openat(struct aio_kiocb *req)
+long aio_do_openat(int fd, const char *filename, int flags, int mode)
+{
+	return do_sys_open(fd, filename, flags, mode);
+}
+
+long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
+{
+	if (flags || mode)
+		return -EINVAL;
+	return do_unlinkat(fd, filename);
+}
+
+static long aio_thread_op_foo_at(struct aio_kiocb *req)
 {
 	u64 buf, offset;
 	long ret;
@@ -1777,18 +1793,21 @@ static long aio_thread_op_openat(struct aio_kiocb *req)
 	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
 		ret = -EFAULT;
 	else {
-		ret = do_sys_open((s32)fd,
-				  (const char __user *)(long)buf,
-				  (int)offset,
-				  (unsigned short)(offset >> 32));
+		do_foo_at_t do_foo_at = (void *)req->ki_data;
+
+		ret = do_foo_at((s32)fd,
+				(const char __user *)(long)buf,
+				(int)offset,
+				(unsigned short)(offset >> 32));
 	}
 	unuse_mm(req->ki_ctx->mm);
 	return ret;
 }
 
-long aio_openat(struct aio_kiocb *req)
+long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
 {
-	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+	req->ki_data = (unsigned long)(void *)do_foo_at;
+	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
 				     AIO_THREAD_NEED_TASK |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
@@ -1895,7 +1914,12 @@ rw_common:
 
 	case IOCB_CMD_OPENAT:
 		if (aio_may_use_threads())
-			ret = aio_openat(req);
+			ret = aio_foo_at(req, aio_do_openat);
+		break;
+
+	case IOCB_CMD_UNLINKAT:
+		if (aio_may_use_threads())
+			ret = aio_foo_at(req, aio_do_unlinkat);
 		break;
 
 	default:
diff --git a/fs/namei.c b/fs/namei.c
index 0c3974c..84ecc7e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3828,7 +3828,7 @@ EXPORT_SYMBOL(vfs_unlink);
  * writeout happening, and we don't want to prevent access to the directory
  * while waiting on the I/O.
  */
-static long do_unlinkat(int dfd, const char __user *pathname)
+long do_unlinkat(int dfd, const char __user *pathname)
 {
 	int error;
 	struct filename *name;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3dc406..9051771 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1509,6 +1509,7 @@ extern int vfs_symlink(struct inode *, struct dentry *, const char *);
 extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
+extern long do_unlinkat(int dfd, const char __user *pathname);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
 extern int vfs_whiteout(struct inode *, struct dentry *);
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 0e16988..63a0d41 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -46,6 +46,7 @@ enum {
 	IOCB_CMD_PWRITEV = 8,
 
 	IOCB_CMD_OPENAT = 9,
+	IOCB_CMD_UNLINKAT = 10,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 10/13] aio: add async unlinkat functionality
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Enable asynchronous deletion of files by adding support for an aio
unlinkat operation.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 42 +++++++++++++++++++++++++++++++++---------
 fs/namei.c                   |  2 +-
 include/linux/fs.h           |  1 +
 include/uapi/linux/aio_abi.h |  1 +
 4 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 346786b..3a70492 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -232,7 +232,11 @@ static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
-long aio_openat(struct aio_kiocb *req);
+
+typedef long (*do_foo_at_t)(int fd, const char *filename, int flags, int mode);
+long aio_do_openat(int fd, const char *filename, int flags, int mode);
+long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
+long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1763,7 +1767,19 @@ long aio_poll(struct aio_kiocb *req)
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-static long aio_thread_op_openat(struct aio_kiocb *req)
+long aio_do_openat(int fd, const char *filename, int flags, int mode)
+{
+	return do_sys_open(fd, filename, flags, mode);
+}
+
+long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
+{
+	if (flags || mode)
+		return -EINVAL;
+	return do_unlinkat(fd, filename);
+}
+
+static long aio_thread_op_foo_at(struct aio_kiocb *req)
 {
 	u64 buf, offset;
 	long ret;
@@ -1777,18 +1793,21 @@ static long aio_thread_op_openat(struct aio_kiocb *req)
 	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
 		ret = -EFAULT;
 	else {
-		ret = do_sys_open((s32)fd,
-				  (const char __user *)(long)buf,
-				  (int)offset,
-				  (unsigned short)(offset >> 32));
+		do_foo_at_t do_foo_at = (void *)req->ki_data;
+
+		ret = do_foo_at((s32)fd,
+				(const char __user *)(long)buf,
+				(int)offset,
+				(unsigned short)(offset >> 32));
 	}
 	unuse_mm(req->ki_ctx->mm);
 	return ret;
 }
 
-long aio_openat(struct aio_kiocb *req)
+long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
 {
-	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+	req->ki_data = (unsigned long)(void *)do_foo_at;
+	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
 				     AIO_THREAD_NEED_TASK |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
@@ -1895,7 +1914,12 @@ rw_common:
 
 	case IOCB_CMD_OPENAT:
 		if (aio_may_use_threads())
-			ret = aio_openat(req);
+			ret = aio_foo_at(req, aio_do_openat);
+		break;
+
+	case IOCB_CMD_UNLINKAT:
+		if (aio_may_use_threads())
+			ret = aio_foo_at(req, aio_do_unlinkat);
 		break;
 
 	default:
diff --git a/fs/namei.c b/fs/namei.c
index 0c3974c..84ecc7e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3828,7 +3828,7 @@ EXPORT_SYMBOL(vfs_unlink);
  * writeout happening, and we don't want to prevent access to the directory
  * while waiting on the I/O.
  */
-static long do_unlinkat(int dfd, const char __user *pathname)
+long do_unlinkat(int dfd, const char __user *pathname)
 {
 	int error;
 	struct filename *name;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3dc406..9051771 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1509,6 +1509,7 @@ extern int vfs_symlink(struct inode *, struct dentry *, const char *);
 extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
+extern long do_unlinkat(int dfd, const char __user *pathname);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
 extern int vfs_whiteout(struct inode *, struct dentry *);
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 0e16988..63a0d41 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -46,6 +46,7 @@ enum {
 	IOCB_CMD_PWRITEV = 8,
 
 	IOCB_CMD_OPENAT = 9,
+	IOCB_CMD_UNLINKAT = 10,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 10/13] aio: add async unlinkat functionality
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Enable asynchronous deletion of files by adding support for an aio
unlinkat operation.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 42 +++++++++++++++++++++++++++++++++---------
 fs/namei.c                   |  2 +-
 include/linux/fs.h           |  1 +
 include/uapi/linux/aio_abi.h |  1 +
 4 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 346786b..3a70492 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -232,7 +232,11 @@ static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
-long aio_openat(struct aio_kiocb *req);
+
+typedef long (*do_foo_at_t)(int fd, const char *filename, int flags, int mode);
+long aio_do_openat(int fd, const char *filename, int flags, int mode);
+long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
+long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1763,7 +1767,19 @@ long aio_poll(struct aio_kiocb *req)
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-static long aio_thread_op_openat(struct aio_kiocb *req)
+long aio_do_openat(int fd, const char *filename, int flags, int mode)
+{
+	return do_sys_open(fd, filename, flags, mode);
+}
+
+long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
+{
+	if (flags || mode)
+		return -EINVAL;
+	return do_unlinkat(fd, filename);
+}
+
+static long aio_thread_op_foo_at(struct aio_kiocb *req)
 {
 	u64 buf, offset;
 	long ret;
@@ -1777,18 +1793,21 @@ static long aio_thread_op_openat(struct aio_kiocb *req)
 	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
 		ret = -EFAULT;
 	else {
-		ret = do_sys_open((s32)fd,
-				  (const char __user *)(long)buf,
-				  (int)offset,
-				  (unsigned short)(offset >> 32));
+		do_foo_at_t do_foo_at = (void *)req->ki_data;
+
+		ret = do_foo_at((s32)fd,
+				(const char __user *)(long)buf,
+				(int)offset,
+				(unsigned short)(offset >> 32));
 	}
 	unuse_mm(req->ki_ctx->mm);
 	return ret;
 }
 
-long aio_openat(struct aio_kiocb *req)
+long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
 {
-	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+	req->ki_data = (unsigned long)(void *)do_foo_at;
+	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
 				     AIO_THREAD_NEED_TASK |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
@@ -1895,7 +1914,12 @@ rw_common:
 
 	case IOCB_CMD_OPENAT:
 		if (aio_may_use_threads())
-			ret = aio_openat(req);
+			ret = aio_foo_at(req, aio_do_openat);
+		break;
+
+	case IOCB_CMD_UNLINKAT:
+		if (aio_may_use_threads())
+			ret = aio_foo_at(req, aio_do_unlinkat);
 		break;
 
 	default:
diff --git a/fs/namei.c b/fs/namei.c
index 0c3974c..84ecc7e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3828,7 +3828,7 @@ EXPORT_SYMBOL(vfs_unlink);
  * writeout happening, and we don't want to prevent access to the directory
  * while waiting on the I/O.
  */
-static long do_unlinkat(int dfd, const char __user *pathname)
+long do_unlinkat(int dfd, const char __user *pathname)
 {
 	int error;
 	struct filename *name;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3dc406..9051771 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1509,6 +1509,7 @@ extern int vfs_symlink(struct inode *, struct dentry *, const char *);
 extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
+extern long do_unlinkat(int dfd, const char __user *pathname);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
 extern int vfs_whiteout(struct inode *, struct dentry *);
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 0e16988..63a0d41 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -46,6 +46,7 @@ enum {
 	IOCB_CMD_PWRITEV = 8,
 
 	IOCB_CMD_OPENAT = 9,
+	IOCB_CMD_UNLINKAT = 10,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 11/13] mm: enable __do_page_cache_readahead() to include present pages
  2016-01-11 22:06 ` Benjamin LaHaise
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

For the upcoming AIO readahead operation it is necessary to know that
all the pages in a readahead request have had reads issued for them or
that the read was satisfied from cache.  Add a parameter to
__do_page_cache_readahead() to instruct it to count these pages in the
return value.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 mm/internal.h  |  4 ++--
 mm/readahead.c | 13 +++++++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 38e24b8..7599068 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -43,7 +43,7 @@ static inline void set_page_count(struct page *page, int v)
 
 extern int __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
-		unsigned long lookahead_size);
+		unsigned long lookahead_size, int report_present);
 
 /*
  * Submit IO for the read-ahead request in file_ra_state.
@@ -52,7 +52,7 @@ static inline unsigned long ra_submit(struct file_ra_state *ra,
 		struct address_space *mapping, struct file *filp)
 {
 	return __do_page_cache_readahead(mapping, filp,
-					ra->start, ra->size, ra->async_size);
+					ra->start, ra->size, ra->async_size, 0);
 }
 
 /*
diff --git a/mm/readahead.c b/mm/readahead.c
index ba22d7f..afd3abe 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -151,12 +151,13 @@ out:
  */
 int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read,
-			unsigned long lookahead_size)
+			unsigned long lookahead_size, int report_present)
 {
 	struct inode *inode = mapping->host;
 	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
+	int present = 0;
 	int page_idx;
 	int ret = 0;
 	loff_t isize = i_size_read(inode);
@@ -178,8 +179,10 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page && !radix_tree_exceptional_entry(page))
+		if (page && !radix_tree_exceptional_entry(page)) {
+			present++;
 			continue;
+		}
 
 		page = page_cache_alloc_readahead(mapping);
 		if (!page)
@@ -199,6 +202,8 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (ret)
 		read_pages(mapping, filp, &page_pool, ret);
 	BUG_ON(!list_empty(&page_pool));
+	if (report_present)
+		ret += present;
 out:
 	return ret;
 }
@@ -222,7 +227,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk, 0);
+						offset, this_chunk, 0, 0);
 		if (err < 0)
 			return err;
 
@@ -441,7 +446,7 @@ ondemand_readahead(struct address_space *mapping,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0, 0);
 
 initial_readahead:
 	ra->start = offset;
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 11/13] mm: enable __do_page_cache_readahead() to include present pages
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

For the upcoming AIO readahead operation it is necessary to know that
all the pages in a readahead request have had reads issued for them or
that the read was satisfied from cache.  Add a parameter to
__do_page_cache_readahead() to instruct it to count these pages in the
return value.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 mm/internal.h  |  4 ++--
 mm/readahead.c | 13 +++++++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 38e24b8..7599068 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -43,7 +43,7 @@ static inline void set_page_count(struct page *page, int v)
 
 extern int __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
-		unsigned long lookahead_size);
+		unsigned long lookahead_size, int report_present);
 
 /*
  * Submit IO for the read-ahead request in file_ra_state.
@@ -52,7 +52,7 @@ static inline unsigned long ra_submit(struct file_ra_state *ra,
 		struct address_space *mapping, struct file *filp)
 {
 	return __do_page_cache_readahead(mapping, filp,
-					ra->start, ra->size, ra->async_size);
+					ra->start, ra->size, ra->async_size, 0);
 }
 
 /*
diff --git a/mm/readahead.c b/mm/readahead.c
index ba22d7f..afd3abe 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -151,12 +151,13 @@ out:
  */
 int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read,
-			unsigned long lookahead_size)
+			unsigned long lookahead_size, int report_present)
 {
 	struct inode *inode = mapping->host;
 	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
+	int present = 0;
 	int page_idx;
 	int ret = 0;
 	loff_t isize = i_size_read(inode);
@@ -178,8 +179,10 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page && !radix_tree_exceptional_entry(page))
+		if (page && !radix_tree_exceptional_entry(page)) {
+			present++;
 			continue;
+		}
 
 		page = page_cache_alloc_readahead(mapping);
 		if (!page)
@@ -199,6 +202,8 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (ret)
 		read_pages(mapping, filp, &page_pool, ret);
 	BUG_ON(!list_empty(&page_pool));
+	if (report_present)
+		ret += present;
 out:
 	return ret;
 }
@@ -222,7 +227,7 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk, 0);
+						offset, this_chunk, 0, 0);
 		if (err < 0)
 			return err;
 
@@ -441,7 +446,7 @@ ondemand_readahead(struct address_space *mapping,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0, 0);
 
 initial_readahead:
 	ra->start = offset;
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 12/13] aio: add support for aio readahead
  2016-01-11 22:06 ` Benjamin LaHaise
@ 2016-01-11 22:07   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Introduce an asynchronous operation to populate the page cache with
pages at a given offset and length.  This operation is conceptually
similar to performing an asynchronous read except that it does not
actually copy the data from the page cache into userspace, rather it
performs readahead and notifies userspace when all pages have been read.

The motivation for this came about as a result of investigation into a
performace degradation when reading from disk.  In the case of a heavily
loaded system, the copy_to_user() performed for an asynchronous read was
temporally quite distant from when the data was actually used.  By only
reading the data into the kernel's page cache, the cache pollution
caused by copying the data into userspace is avoided, and overall system
performance is improved.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 141 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |   1 +
 2 files changed, 142 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 3a70492..5cb3d74 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,7 @@
 #include <linux/mount.h>
 #include <linux/fdtable.h>
 #include <linux/fs_struct.h>
+#include <../mm/internal.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -238,6 +239,8 @@ long aio_do_openat(int fd, const char *filename, int flags, int mode);
 long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
 long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
+long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
+
 static __always_inline bool aio_may_use_threads(void)
 {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
@@ -1812,6 +1815,137 @@ long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
+
+static int aio_ra_filler(void *data, struct page *page)
+{
+	struct file *file = data;
+
+	return file->f_mapping->a_ops->readpage(file, page);
+}
+
+static long aio_ra_wait_on_pages(struct file *file, pgoff_t start,
+				 unsigned long nr)
+{
+	struct address_space *mapping = file->f_mapping;
+	unsigned long i;
+
+	/* Wait on pages starting at the end to holdfully avoid too many
+	 * wakeups.
+	 */
+	for (i = nr; i-- > 0; ) {
+		pgoff_t index = start + i;
+		struct page *page;
+
+		/* First do the quick check to see if the page is present and
+		 * uptodate.
+		 */
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		rcu_read_unlock();
+
+		if (page && !radix_tree_exceptional_entry(page) &&
+		    PageUptodate(page)) {
+			continue;
+		}
+
+		page = read_cache_page(mapping, index, aio_ra_filler, file);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+		page_cache_release(page);
+	}
+	return 0;
+}
+
+static long aio_thread_op_readahead(struct aio_kiocb *iocb)
+{
+	pgoff_t start, end, nr, offset;
+	long ret = 0;
+
+	start = iocb->common.ki_pos >> PAGE_CACHE_SHIFT;
+	end = (iocb->common.ki_pos + iocb->ki_data - 1) >> PAGE_CACHE_SHIFT;
+	nr = end - start + 1;
+
+	for (offset = 0; offset < nr; ) {
+		pgoff_t chunk = nr - offset;
+		unsigned long max_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+
+		if (chunk > max_chunk)
+			chunk = max_chunk;
+
+		ret = __do_page_cache_readahead(iocb->common.ki_filp->f_mapping,
+						iocb->common.ki_filp,
+						start + offset, chunk, 0, 1);
+		if (ret <= 0)
+			break;
+		offset += ret;
+	}
+
+	if (!offset && ret < 0)
+		return ret;
+
+	if (offset > 0) {
+		ret = aio_ra_wait_on_pages(iocb->common.ki_filp, start, offset);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (offset == nr)
+		return iocb->ki_data;
+	if (offset > 0)
+		return ((start + offset) << PAGE_CACHE_SHIFT) -
+			iocb->common.ki_pos;
+	return 0;
+}
+
+long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
+{
+	struct address_space *mapping = iocb->common.ki_filp->f_mapping;
+	pgoff_t index, end;
+	loff_t epos, isize;
+	int do_io = 0;
+
+	if (!mapping || !mapping->a_ops)
+		return -EBADF;
+	if (!mapping->a_ops->readpage && !mapping->a_ops->readpages)
+		return -EBADF;
+	if (!len)
+		return 0;
+
+	epos = iocb->common.ki_pos + len;
+	if (epos < 0)
+		return -EINVAL;
+	isize = i_size_read(mapping->host);
+	if (isize < epos) {
+		epos = isize - iocb->common.ki_pos;
+		if (epos <= 0)
+			return 0;
+		if ((unsigned long)epos != epos)
+			return -EINVAL;
+		len = epos;
+	}
+
+	index = iocb->common.ki_pos >> PAGE_CACHE_SHIFT;
+	end = (iocb->common.ki_pos + len - 1) >> PAGE_CACHE_SHIFT;
+	iocb->ki_data = len;
+	if (end < index)
+		return -EINVAL;
+
+	do {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		rcu_read_unlock();
+
+		if (!page || radix_tree_exceptional_entry(page) ||
+		    !PageUptodate(page))
+			do_io = 1;
+	} while (!do_io && (index++ < end));
+
+	if (do_io)
+		return aio_thread_queue_iocb(iocb, aio_thread_op_readahead, 0);
+	return len;
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1922,6 +2056,13 @@ rw_common:
 			ret = aio_foo_at(req, aio_do_unlinkat);
 		break;
 
+	case IOCB_CMD_READAHEAD:
+		if (user_iocb->aio_buf)
+			return -EINVAL;
+		if (aio_may_use_threads())
+			ret = aio_readahead(req, user_iocb->aio_nbytes);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 63a0d41..4def682 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -47,6 +47,7 @@ enum {
 
 	IOCB_CMD_OPENAT = 9,
 	IOCB_CMD_UNLINKAT = 10,
+	IOCB_CMD_READAHEAD = 12,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 12/13] aio: add support for aio readahead
@ 2016-01-11 22:07   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:07 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Introduce an asynchronous operation to populate the page cache with
pages at a given offset and length.  This operation is conceptually
similar to performing an asynchronous read except that it does not
actually copy the data from the page cache into userspace, rather it
performs readahead and notifies userspace when all pages have been read.

The motivation for this came about as a result of investigation into a
performace degradation when reading from disk.  In the case of a heavily
loaded system, the copy_to_user() performed for an asynchronous read was
temporally quite distant from when the data was actually used.  By only
reading the data into the kernel's page cache, the cache pollution
caused by copying the data into userspace is avoided, and overall system
performance is improved.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 141 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |   1 +
 2 files changed, 142 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 3a70492..5cb3d74 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,7 @@
 #include <linux/mount.h>
 #include <linux/fdtable.h>
 #include <linux/fs_struct.h>
+#include <../mm/internal.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -238,6 +239,8 @@ long aio_do_openat(int fd, const char *filename, int flags, int mode);
 long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
 long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
+long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
+
 static __always_inline bool aio_may_use_threads(void)
 {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
@@ -1812,6 +1815,137 @@ long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
+
+static int aio_ra_filler(void *data, struct page *page)
+{
+	struct file *file = data;
+
+	return file->f_mapping->a_ops->readpage(file, page);
+}
+
+static long aio_ra_wait_on_pages(struct file *file, pgoff_t start,
+				 unsigned long nr)
+{
+	struct address_space *mapping = file->f_mapping;
+	unsigned long i;
+
+	/* Wait on pages starting at the end to holdfully avoid too many
+	 * wakeups.
+	 */
+	for (i = nr; i-- > 0; ) {
+		pgoff_t index = start + i;
+		struct page *page;
+
+		/* First do the quick check to see if the page is present and
+		 * uptodate.
+		 */
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		rcu_read_unlock();
+
+		if (page && !radix_tree_exceptional_entry(page) &&
+		    PageUptodate(page)) {
+			continue;
+		}
+
+		page = read_cache_page(mapping, index, aio_ra_filler, file);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+		page_cache_release(page);
+	}
+	return 0;
+}
+
+static long aio_thread_op_readahead(struct aio_kiocb *iocb)
+{
+	pgoff_t start, end, nr, offset;
+	long ret = 0;
+
+	start = iocb->common.ki_pos >> PAGE_CACHE_SHIFT;
+	end = (iocb->common.ki_pos + iocb->ki_data - 1) >> PAGE_CACHE_SHIFT;
+	nr = end - start + 1;
+
+	for (offset = 0; offset < nr; ) {
+		pgoff_t chunk = nr - offset;
+		unsigned long max_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+
+		if (chunk > max_chunk)
+			chunk = max_chunk;
+
+		ret = __do_page_cache_readahead(iocb->common.ki_filp->f_mapping,
+						iocb->common.ki_filp,
+						start + offset, chunk, 0, 1);
+		if (ret <= 0)
+			break;
+		offset += ret;
+	}
+
+	if (!offset && ret < 0)
+		return ret;
+
+	if (offset > 0) {
+		ret = aio_ra_wait_on_pages(iocb->common.ki_filp, start, offset);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (offset == nr)
+		return iocb->ki_data;
+	if (offset > 0)
+		return ((start + offset) << PAGE_CACHE_SHIFT) -
+			iocb->common.ki_pos;
+	return 0;
+}
+
+long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
+{
+	struct address_space *mapping = iocb->common.ki_filp->f_mapping;
+	pgoff_t index, end;
+	loff_t epos, isize;
+	int do_io = 0;
+
+	if (!mapping || !mapping->a_ops)
+		return -EBADF;
+	if (!mapping->a_ops->readpage && !mapping->a_ops->readpages)
+		return -EBADF;
+	if (!len)
+		return 0;
+
+	epos = iocb->common.ki_pos + len;
+	if (epos < 0)
+		return -EINVAL;
+	isize = i_size_read(mapping->host);
+	if (isize < epos) {
+		epos = isize - iocb->common.ki_pos;
+		if (epos <= 0)
+			return 0;
+		if ((unsigned long)epos != epos)
+			return -EINVAL;
+		len = epos;
+	}
+
+	index = iocb->common.ki_pos >> PAGE_CACHE_SHIFT;
+	end = (iocb->common.ki_pos + len - 1) >> PAGE_CACHE_SHIFT;
+	iocb->ki_data = len;
+	if (end < index)
+		return -EINVAL;
+
+	do {
+		struct page *page;
+
+		rcu_read_lock();
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		rcu_read_unlock();
+
+		if (!page || radix_tree_exceptional_entry(page) ||
+		    !PageUptodate(page))
+			do_io = 1;
+	} while (!do_io && (index++ < end));
+
+	if (do_io)
+		return aio_thread_queue_iocb(iocb, aio_thread_op_readahead, 0);
+	return len;
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1922,6 +2056,13 @@ rw_common:
 			ret = aio_foo_at(req, aio_do_unlinkat);
 		break;
 
+	case IOCB_CMD_READAHEAD:
+		if (user_iocb->aio_buf)
+			return -EINVAL;
+		if (aio_may_use_threads())
+			ret = aio_readahead(req, user_iocb->aio_nbytes);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 63a0d41..4def682 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -47,6 +47,7 @@ enum {
 
 	IOCB_CMD_OPENAT = 9,
 	IOCB_CMD_UNLINKAT = 10,
+	IOCB_CMD_READAHEAD = 12,
 };
 
 /*
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 13/13] aio: add support for aio renameat operation
  2016-01-11 22:06 ` Benjamin LaHaise
  (?)
@ 2016-01-11 22:08   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:08 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Add support for an aio renameat operation that implements an
asynchronous renameat2().

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 63 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |  9 +++++++
 2 files changed, 72 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 5cb3d74..aaecadf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -240,6 +240,7 @@ long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
 long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
 long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
+long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1946,6 +1947,63 @@ long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
 		return aio_thread_queue_iocb(iocb, aio_thread_op_readahead, 0);
 	return len;
 }
+
+static long aio_thread_op_renameat(struct aio_kiocb *iocb)
+{
+	const void * __user user_info = (void * __user)iocb->common.private;
+	struct renameat_info info;
+	const char * __user old;
+	const char * __user new;
+	int olddir, newdir;
+	unsigned flags;
+	long ret;
+
+	use_mm(aio_get_mm(&iocb->common));
+	if (unlikely(copy_from_user(&info, user_info, sizeof(info)))) {
+		ret = -EFAULT;
+		goto done;
+	}
+
+	old = (const char * __user)(unsigned long)info.oldpath;
+	new = (const char * __user)(unsigned long)info.newpath;
+	olddir = info.olddirfd;
+	newdir = info.newdirfd;
+	flags = info.flags;
+
+	if (((unsigned long)old != info.oldpath) ||
+	    ((unsigned long)new != info.newpath) ||
+	    (olddir != info.olddirfd) ||
+	    (newdir != info.newdirfd) ||
+	    (flags != info.flags))
+		ret = -EINVAL;
+	else
+		ret = sys_renameat2(olddir, old, newdir, new, flags);
+done:
+	unuse_mm(aio_get_mm(&iocb->common));
+	return ret;
+}
+
+long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb)
+{
+	const void * __user user_info;
+
+	if (user_iocb->aio_nbytes != sizeof(struct renameat_info))
+		return -EINVAL;
+	if (user_iocb->aio_offset)
+		return -EINVAL;
+
+	user_info = (const void * __user)user_iocb->aio_buf;
+	if (unlikely(!access_ok(VERIFY_READ, user_info,
+				sizeof(struct renameat_info))))
+		return -EFAULT;
+
+	iocb->common.private = (void *)user_info;
+	return aio_thread_queue_iocb(iocb, aio_thread_op_renameat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FS |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -2063,6 +2121,11 @@ rw_common:
 			ret = aio_readahead(req, user_iocb->aio_nbytes);
 		break;
 
+	case IOCB_CMD_RENAMEAT:
+		if (aio_may_use_threads())
+			ret = aio_renameat(req, user_iocb);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 4def682..9417abd 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -48,6 +48,7 @@ enum {
 	IOCB_CMD_OPENAT = 9,
 	IOCB_CMD_UNLINKAT = 10,
 	IOCB_CMD_READAHEAD = 12,
+	IOCB_CMD_RENAMEAT = 13,
 };
 
 /*
@@ -108,6 +109,14 @@ struct iocb {
 	__u32	aio_resfd;
 }; /* 64 bytes */
 
+struct renameat_info {
+	__s64	olddirfd;
+	__u64	oldpath;
+	__s64	newdirfd;
+	__u64	newpath;
+	__u64	flags;
+};
+
 #undef IFBIG
 #undef IFLITTLE
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 13/13] aio: add support for aio renameat operation
@ 2016-01-11 22:08   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:08 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Add support for an aio renameat operation that implements an
asynchronous renameat2().

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 63 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |  9 +++++++
 2 files changed, 72 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 5cb3d74..aaecadf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -240,6 +240,7 @@ long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
 long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
 long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
+long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1946,6 +1947,63 @@ long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
 		return aio_thread_queue_iocb(iocb, aio_thread_op_readahead, 0);
 	return len;
 }
+
+static long aio_thread_op_renameat(struct aio_kiocb *iocb)
+{
+	const void * __user user_info = (void * __user)iocb->common.private;
+	struct renameat_info info;
+	const char * __user old;
+	const char * __user new;
+	int olddir, newdir;
+	unsigned flags;
+	long ret;
+
+	use_mm(aio_get_mm(&iocb->common));
+	if (unlikely(copy_from_user(&info, user_info, sizeof(info)))) {
+		ret = -EFAULT;
+		goto done;
+	}
+
+	old = (const char * __user)(unsigned long)info.oldpath;
+	new = (const char * __user)(unsigned long)info.newpath;
+	olddir = info.olddirfd;
+	newdir = info.newdirfd;
+	flags = info.flags;
+
+	if (((unsigned long)old != info.oldpath) ||
+	    ((unsigned long)new != info.newpath) ||
+	    (olddir != info.olddirfd) ||
+	    (newdir != info.newdirfd) ||
+	    (flags != info.flags))
+		ret = -EINVAL;
+	else
+		ret = sys_renameat2(olddir, old, newdir, new, flags);
+done:
+	unuse_mm(aio_get_mm(&iocb->common));
+	return ret;
+}
+
+long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb)
+{
+	const void * __user user_info;
+
+	if (user_iocb->aio_nbytes != sizeof(struct renameat_info))
+		return -EINVAL;
+	if (user_iocb->aio_offset)
+		return -EINVAL;
+
+	user_info = (const void * __user)user_iocb->aio_buf;
+	if (unlikely(!access_ok(VERIFY_READ, user_info,
+				sizeof(struct renameat_info))))
+		return -EFAULT;
+
+	iocb->common.private = (void *)user_info;
+	return aio_thread_queue_iocb(iocb, aio_thread_op_renameat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FS |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -2063,6 +2121,11 @@ rw_common:
 			ret = aio_readahead(req, user_iocb->aio_nbytes);
 		break;
 
+	case IOCB_CMD_RENAMEAT:
+		if (aio_may_use_threads())
+			ret = aio_renameat(req, user_iocb);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 4def682..9417abd 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -48,6 +48,7 @@ enum {
 	IOCB_CMD_OPENAT = 9,
 	IOCB_CMD_UNLINKAT = 10,
 	IOCB_CMD_READAHEAD = 12,
+	IOCB_CMD_RENAMEAT = 13,
 };
 
 /*
@@ -108,6 +109,14 @@ struct iocb {
 	__u32	aio_resfd;
 }; /* 64 bytes */
 
+struct renameat_info {
+	__s64	olddirfd;
+	__u64	oldpath;
+	__s64	newdirfd;
+	__u64	newpath;
+	__u64	flags;
+};
+
 #undef IFBIG
 #undef IFLITTLE
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 13/13] aio: add support for aio renameat operation
@ 2016-01-11 22:08   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-11 22:08 UTC (permalink / raw)
  To: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm
  Cc: Alexander Viro, Andrew Morton, Linus Torvalds

Add support for an aio renameat operation that implements an
asynchronous renameat2().

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 63 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/aio_abi.h |  9 +++++++
 2 files changed, 72 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 5cb3d74..aaecadf 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -240,6 +240,7 @@ long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
 long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
 
 long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
+long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1946,6 +1947,63 @@ long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
 		return aio_thread_queue_iocb(iocb, aio_thread_op_readahead, 0);
 	return len;
 }
+
+static long aio_thread_op_renameat(struct aio_kiocb *iocb)
+{
+	const void * __user user_info = (void * __user)iocb->common.private;
+	struct renameat_info info;
+	const char * __user old;
+	const char * __user new;
+	int olddir, newdir;
+	unsigned flags;
+	long ret;
+
+	use_mm(aio_get_mm(&iocb->common));
+	if (unlikely(copy_from_user(&info, user_info, sizeof(info)))) {
+		ret = -EFAULT;
+		goto done;
+	}
+
+	old = (const char * __user)(unsigned long)info.oldpath;
+	new = (const char * __user)(unsigned long)info.newpath;
+	olddir = info.olddirfd;
+	newdir = info.newdirfd;
+	flags = info.flags;
+
+	if (((unsigned long)old != info.oldpath) ||
+	    ((unsigned long)new != info.newpath) ||
+	    (olddir != info.olddirfd) ||
+	    (newdir != info.newdirfd) ||
+	    (flags != info.flags))
+		ret = -EINVAL;
+	else
+		ret = sys_renameat2(olddir, old, newdir, new, flags);
+done:
+	unuse_mm(aio_get_mm(&iocb->common));
+	return ret;
+}
+
+long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb)
+{
+	const void * __user user_info;
+
+	if (user_iocb->aio_nbytes != sizeof(struct renameat_info))
+		return -EINVAL;
+	if (user_iocb->aio_offset)
+		return -EINVAL;
+
+	user_info = (const void * __user)user_iocb->aio_buf;
+	if (unlikely(!access_ok(VERIFY_READ, user_info,
+				sizeof(struct renameat_info))))
+		return -EFAULT;
+
+	iocb->common.private = (void *)user_info;
+	return aio_thread_queue_iocb(iocb, aio_thread_op_renameat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FS |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -2063,6 +2121,11 @@ rw_common:
 			ret = aio_readahead(req, user_iocb->aio_nbytes);
 		break;
 
+	case IOCB_CMD_RENAMEAT:
+		if (aio_may_use_threads())
+			ret = aio_renameat(req, user_iocb);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 4def682..9417abd 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -48,6 +48,7 @@ enum {
 	IOCB_CMD_OPENAT = 9,
 	IOCB_CMD_UNLINKAT = 10,
 	IOCB_CMD_READAHEAD = 12,
+	IOCB_CMD_RENAMEAT = 13,
 };
 
 /*
@@ -108,6 +109,14 @@ struct iocb {
 	__u32	aio_resfd;
 }; /* 64 bytes */
 
+struct renameat_info {
+	__s64	olddirfd;
+	__u64	oldpath;
+	__s64	newdirfd;
+	__u64	newpath;
+	__u64	flags;
+};
+
 #undef IFBIG
 #undef IFLITTLE
 
-- 
2.5.0


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
  2016-01-11 22:07   ` Benjamin LaHaise
@ 2016-01-12  0:22     ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  0:22 UTC (permalink / raw)
  To: Benjamin LaHaise, Ingo Molnar
  Cc: linux-aio, linux-fsdevel, Linux Kernel Mailing List, Linux API,
	linux-mm, Alexander Viro, Andrew Morton

On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> Another blocking operation used by applications that want aio
> functionality is that of opening files that are not resident in memory.
> Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

So I think this is ridiculously ugly.

AIO is a horrible ad-hoc design, with the main excuse being "other,
less gifted people, made that design, and we are implementing it for
compatibility because database people - who seldom have any shred of
taste - actually use it".

But AIO was always really really ugly.

Now you introduce the notion of doing almost arbitrary system calls
asynchronously in threads, but then you use that ass-backwards nasty
interface to do so.

Why?

If you want to do arbitrary asynchronous system calls, just *do* it.
But do _that_, not "let's extend this horrible interface in arbitrary
random ways one special system call at a time".

In other words, why is the interface not simply: "do arbitrary system
call X with arguments A, B, C, D asynchronously using a kernel
thread".

That's something that a lot of people might use. In fact, if they can
avoid the nasty AIO interface, maybe they'll even use it for things
like read() and write().

So I really think it would be a nice thing to allow some kind of
arbitrary "queue up asynchronous system call" model.

But I do not think the AIO model should be the model used for that,
even if I think there might be some shared infrastructure.

So I would seriously suggest:

 - how about we add a true "asynchronous system call" interface

 - make it be a list of system calls with a futex completion for each
list entry, so that you can easily wait for the end result that way.

 - maybe (and this is where it gets really iffy) you could even pass
in the result of one system call to the next, so that you can do
things like

       fd = openat(..)
       ret = read(fd, ..)

   asynchronously and then just wait for the read() to complete.

and let us *not* tie this to the aio interface.

In fact, if we do it well, we can go the other way, and try to
implement the nasty AIO interface on top of the generic "just do
things asynchronously".

And I actually think many of your kernel thread parts are good for a
generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
all makes sense, although I do suspect you could just make it
unconditional. The cost of a few atomics shouldn't be excessive when
we're talking "use a thread to do op X".

What do you think? Do you think it might be possible to aim for a
generic "do system call asynchronously" model instead?

I'm adding Ingo the to cc, because I think Ingo had a "run this list
of system calls" patch at one point - in order to avoid system call
overhead. I don't think that was very interesting (because system call
overhead is seldom all that noticeable for any interesting system
calls), but with the "let's do the list asynchronously" addition it
might be much more intriguing. Ingo, do I remember correctly that it
was you? I might be confused about who wrote that patch, and I can't
find it now.

               Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  0:22     ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  0:22 UTC (permalink / raw)
  To: Benjamin LaHaise, Ingo Molnar
  Cc: linux-aio, linux-fsdevel, Linux Kernel Mailing List, Linux API,
	linux-mm, Alexander Viro, Andrew Morton

On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> Another blocking operation used by applications that want aio
> functionality is that of opening files that are not resident in memory.
> Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

So I think this is ridiculously ugly.

AIO is a horrible ad-hoc design, with the main excuse being "other,
less gifted people, made that design, and we are implementing it for
compatibility because database people - who seldom have any shred of
taste - actually use it".

But AIO was always really really ugly.

Now you introduce the notion of doing almost arbitrary system calls
asynchronously in threads, but then you use that ass-backwards nasty
interface to do so.

Why?

If you want to do arbitrary asynchronous system calls, just *do* it.
But do _that_, not "let's extend this horrible interface in arbitrary
random ways one special system call at a time".

In other words, why is the interface not simply: "do arbitrary system
call X with arguments A, B, C, D asynchronously using a kernel
thread".

That's something that a lot of people might use. In fact, if they can
avoid the nasty AIO interface, maybe they'll even use it for things
like read() and write().

So I really think it would be a nice thing to allow some kind of
arbitrary "queue up asynchronous system call" model.

But I do not think the AIO model should be the model used for that,
even if I think there might be some shared infrastructure.

So I would seriously suggest:

 - how about we add a true "asynchronous system call" interface

 - make it be a list of system calls with a futex completion for each
list entry, so that you can easily wait for the end result that way.

 - maybe (and this is where it gets really iffy) you could even pass
in the result of one system call to the next, so that you can do
things like

       fd = openat(..)
       ret = read(fd, ..)

   asynchronously and then just wait for the read() to complete.

and let us *not* tie this to the aio interface.

In fact, if we do it well, we can go the other way, and try to
implement the nasty AIO interface on top of the generic "just do
things asynchronously".

And I actually think many of your kernel thread parts are good for a
generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
all makes sense, although I do suspect you could just make it
unconditional. The cost of a few atomics shouldn't be excessive when
we're talking "use a thread to do op X".

What do you think? Do you think it might be possible to aim for a
generic "do system call asynchronously" model instead?

I'm adding Ingo the to cc, because I think Ingo had a "run this list
of system calls" patch at one point - in order to avoid system call
overhead. I don't think that was very interesting (because system call
overhead is seldom all that noticeable for any interesting system
calls), but with the "let's do the list asynchronously" addition it
might be much more intriguing. Ingo, do I remember correctly that it
was you? I might be confused about who wrote that patch, and I can't
find it now.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-11 22:07   ` Benjamin LaHaise
@ 2016-01-12  1:11     ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  1:11 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm,
	Alexander Viro, Andrew Morton, Linus Torvalds

On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> Enable a fully asynchronous fsync and fdatasync operations in aio using
> the aio thread queuing mechanism.
> 
> Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

Insufficient. Needs the range to be passed through and call
vfs_fsync_range(), as I implemented here:

https://lkml.org/lkml/2015/10/28/878

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  1:11     ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  1:11 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm,
	Alexander Viro, Andrew Morton, Linus Torvalds

On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> Enable a fully asynchronous fsync and fdatasync operations in aio using
> the aio thread queuing mechanism.
> 
> Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

Insufficient. Needs the range to be passed through and call
vfs_fsync_range(), as I implemented here:

https://lkml.org/lkml/2015/10/28/878

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
  2016-01-12  0:22     ` Linus Torvalds
  (?)
@ 2016-01-12  1:17       ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12  1:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-aio, linux-fsdevel, Linux Kernel Mailing List,
	Linux API, linux-mm, Alexander Viro, Andrew Morton

On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.
> 
> Why?

Understood, but there are some reasons behind this.  The core aio submit
mechanism is modeled after the lio_listio() call in posix.  While the
cost of performing syscalls has decreased substantially over the last 10
years, the cost of context switches has not.  Some AIO operations really
want to do part of the work in the context of the original submitter for
the work.  That was/is a critical piece of the async readahead
functionality in this series -- without being able to do a quick return
to the caller when all the cached data is allready resident in the
kernel, there is a significant performance degradation in my tests.  For
other operations which are going to do blocking i/o anyways, the cost of
the context switch often becomes noise.

The async readahead also fills a fills a hole in the proposed extensions
to preadv()/pwritev() -- they need some way to trigger and know when a
readahead operation has completed.  One needs a completion queue of some
sort to figure out which operation has completed in a reasonable
efficient manner.  The futex doesn't really have the ability to do this.

Thread dispatching is another problem the applications I work on
encounter, and AIO helps in this particular area because a thread that
is running hot can simply check the AIO event ring buffer for new events
in its main event loop.  Userspace fundamentally *cannot* do a good job of
dispatching work to threads.  The code I've see other developers come up
with ends up doing things like epoll() in one thread followed by
dispatching the receieved events to different threads.  This ends up
making multiple expensive syscalls (since locking and cross CPU bouncing
is required) when the kernel could just direct things to the right
thread in the first place.

There are a lot of requirements bringing additional complexity that start
to surface once you look at how some of these applications are actually
written.

> If you want to do arbitrary asynchronous system calls, just *do* it.
> But do _that_, not "let's extend this horrible interface in arbitrary
> random ways one special system call at a time".
> 
> In other words, why is the interface not simply: "do arbitrary system
> call X with arguments A, B, C, D asynchronously using a kernel
> thread".

We've had a few proposals to do this, none of which have really managed 
to tackle all the problems that arose.  If we go down this path, we will 
end up needing a table of what syscalls can actually be performed 
asynchronously, and flags indicating what bits of context those syscalls
require.  This does end up looking a bit like how AIO does things
depending on how hard you squint.

I'm not opposed to reworking how AIO dispatches things.  If we're willing 
to relax some constraints (like the hard enforced limits on the number
of AIOs in flight), things can be substantially simplified.  Again,
worries about things like memory usage today are vastly different than
they were back in the early '00s, so the decisions that make sense now
will certainly change the design.

Cancellation is also a concern.  Cancellation is not something that can
be sacrificed.  Without some mechanism to cancel operations that are in
flight, there is no way for a process to cleanly exit.  This patch
series nicely proves that signals work very well for cancellation, and
fit in with a lot of the code we already have.  This implies we would
need to treat threads doing async operations differently from normal
threads.  What happens with the pid namespace?

> That's something that a lot of people might use. In fact, if they can
> avoid the nasty AIO interface, maybe they'll even use it for things
> like read() and write().
> 
> So I really think it would be a nice thing to allow some kind of
> arbitrary "queue up asynchronous system call" model.
> 
> But I do not think the AIO model should be the model used for that,
> even if I think there might be some shared infrastructure.
> 
> So I would seriously suggest:
> 
>  - how about we add a true "asynchronous system call" interface
> 
>  - make it be a list of system calls with a futex completion for each
> list entry, so that you can easily wait for the end result that way.
> 
>  - maybe (and this is where it gets really iffy) you could even pass
> in the result of one system call to the next, so that you can do
> things like
> 
>        fd = openat(..)
>        ret = read(fd, ..)
> 
>    asynchronously and then just wait for the read() to complete.
> 
> and let us *not* tie this to the aio interface.
> 
> In fact, if we do it well, we can go the other way, and try to
> implement the nasty AIO interface on top of the generic "just do
> things asynchronously".
> 
> And I actually think many of your kernel thread parts are good for a
> generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
> all makes sense, although I do suspect you could just make it
> unconditional. The cost of a few atomics shouldn't be excessive when
> we're talking "use a thread to do op X".
> 
> What do you think? Do you think it might be possible to aim for a
> generic "do system call asynchronously" model instead?

Maybe it's not too bad to do -- the syscall() primitive is reasonably 
well defined and is supported across architectures, but we're going to 
need new wrappers for *every* syscall supported.  Odds are the work will 
have to be done incrementally to weed out which syscalls are safe and 
which are not, but there is certainly no reason we can't reuse syscall 
numbers and the same argument layout.

Chaining things becomes messy.  There are some cases where that works,
but at least on the applications I've worked on, there tends to be a
fair amount of logic that needs to be run before you can figure out what
and where the next operation is.  The canonical example I can think of
is the case where one is retreiving data from disk.  The first operation
is a read into some table to find out where data is located, the next
operation is a search (binary search in the case I'm thinking of) in the
data that was just read to figure out which record actually contains the
data the app cares about, followed by a read to actually fetch the data
the user actually requires.

And it gets more complicated: different disk i/os need to be issued with
different priorities (something that was not included in what I just
posted today, but is work I plan to propose for merging in the future).
In some cases the priority is known beforehand, but in other cases it
needs to be adjusted dynamically depending on information fetched (users
don't like it if huge i/os completely starve their smaller i/os for
significant amounts of time).

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

I'd certainly be interested in hearing more ideas concerning
requirements.

Sorry for the giant wall of text...  Nothing is simple! =-)

		-ben

>                Linus

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  1:17       ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12  1:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-aio, linux-fsdevel, Linux Kernel Mailing List,
	Linux API, linux-mm, Alexander Viro, Andrew Morton

On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.
> 
> Why?

Understood, but there are some reasons behind this.  The core aio submit
mechanism is modeled after the lio_listio() call in posix.  While the
cost of performing syscalls has decreased substantially over the last 10
years, the cost of context switches has not.  Some AIO operations really
want to do part of the work in the context of the original submitter for
the work.  That was/is a critical piece of the async readahead
functionality in this series -- without being able to do a quick return
to the caller when all the cached data is allready resident in the
kernel, there is a significant performance degradation in my tests.  For
other operations which are going to do blocking i/o anyways, the cost of
the context switch often becomes noise.

The async readahead also fills a fills a hole in the proposed extensions
to preadv()/pwritev() -- they need some way to trigger and know when a
readahead operation has completed.  One needs a completion queue of some
sort to figure out which operation has completed in a reasonable
efficient manner.  The futex doesn't really have the ability to do this.

Thread dispatching is another problem the applications I work on
encounter, and AIO helps in this particular area because a thread that
is running hot can simply check the AIO event ring buffer for new events
in its main event loop.  Userspace fundamentally *cannot* do a good job of
dispatching work to threads.  The code I've see other developers come up
with ends up doing things like epoll() in one thread followed by
dispatching the receieved events to different threads.  This ends up
making multiple expensive syscalls (since locking and cross CPU bouncing
is required) when the kernel could just direct things to the right
thread in the first place.

There are a lot of requirements bringing additional complexity that start
to surface once you look at how some of these applications are actually
written.

> If you want to do arbitrary asynchronous system calls, just *do* it.
> But do _that_, not "let's extend this horrible interface in arbitrary
> random ways one special system call at a time".
> 
> In other words, why is the interface not simply: "do arbitrary system
> call X with arguments A, B, C, D asynchronously using a kernel
> thread".

We've had a few proposals to do this, none of which have really managed 
to tackle all the problems that arose.  If we go down this path, we will 
end up needing a table of what syscalls can actually be performed 
asynchronously, and flags indicating what bits of context those syscalls
require.  This does end up looking a bit like how AIO does things
depending on how hard you squint.

I'm not opposed to reworking how AIO dispatches things.  If we're willing 
to relax some constraints (like the hard enforced limits on the number
of AIOs in flight), things can be substantially simplified.  Again,
worries about things like memory usage today are vastly different than
they were back in the early '00s, so the decisions that make sense now
will certainly change the design.

Cancellation is also a concern.  Cancellation is not something that can
be sacrificed.  Without some mechanism to cancel operations that are in
flight, there is no way for a process to cleanly exit.  This patch
series nicely proves that signals work very well for cancellation, and
fit in with a lot of the code we already have.  This implies we would
need to treat threads doing async operations differently from normal
threads.  What happens with the pid namespace?

> That's something that a lot of people might use. In fact, if they can
> avoid the nasty AIO interface, maybe they'll even use it for things
> like read() and write().
> 
> So I really think it would be a nice thing to allow some kind of
> arbitrary "queue up asynchronous system call" model.
> 
> But I do not think the AIO model should be the model used for that,
> even if I think there might be some shared infrastructure.
> 
> So I would seriously suggest:
> 
>  - how about we add a true "asynchronous system call" interface
> 
>  - make it be a list of system calls with a futex completion for each
> list entry, so that you can easily wait for the end result that way.
> 
>  - maybe (and this is where it gets really iffy) you could even pass
> in the result of one system call to the next, so that you can do
> things like
> 
>        fd = openat(..)
>        ret = read(fd, ..)
> 
>    asynchronously and then just wait for the read() to complete.
> 
> and let us *not* tie this to the aio interface.
> 
> In fact, if we do it well, we can go the other way, and try to
> implement the nasty AIO interface on top of the generic "just do
> things asynchronously".
> 
> And I actually think many of your kernel thread parts are good for a
> generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
> all makes sense, although I do suspect you could just make it
> unconditional. The cost of a few atomics shouldn't be excessive when
> we're talking "use a thread to do op X".
> 
> What do you think? Do you think it might be possible to aim for a
> generic "do system call asynchronously" model instead?

Maybe it's not too bad to do -- the syscall() primitive is reasonably 
well defined and is supported across architectures, but we're going to 
need new wrappers for *every* syscall supported.  Odds are the work will 
have to be done incrementally to weed out which syscalls are safe and 
which are not, but there is certainly no reason we can't reuse syscall 
numbers and the same argument layout.

Chaining things becomes messy.  There are some cases where that works,
but at least on the applications I've worked on, there tends to be a
fair amount of logic that needs to be run before you can figure out what
and where the next operation is.  The canonical example I can think of
is the case where one is retreiving data from disk.  The first operation
is a read into some table to find out where data is located, the next
operation is a search (binary search in the case I'm thinking of) in the
data that was just read to figure out which record actually contains the
data the app cares about, followed by a read to actually fetch the data
the user actually requires.

And it gets more complicated: different disk i/os need to be issued with
different priorities (something that was not included in what I just
posted today, but is work I plan to propose for merging in the future).
In some cases the priority is known beforehand, but in other cases it
needs to be adjusted dynamically depending on information fetched (users
don't like it if huge i/os completely starve their smaller i/os for
significant amounts of time).

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

I'd certainly be interested in hearing more ideas concerning
requirements.

Sorry for the giant wall of text...  Nothing is simple! =-)

		-ben

>                Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  1:17       ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12  1:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-aio, linux-fsdevel, Linux Kernel Mailing List,
	Linux API, linux-mm, Alexander Viro, Andrew Morton

On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.
> 
> Why?

Understood, but there are some reasons behind this.  The core aio submit
mechanism is modeled after the lio_listio() call in posix.  While the
cost of performing syscalls has decreased substantially over the last 10
years, the cost of context switches has not.  Some AIO operations really
want to do part of the work in the context of the original submitter for
the work.  That was/is a critical piece of the async readahead
functionality in this series -- without being able to do a quick return
to the caller when all the cached data is allready resident in the
kernel, there is a significant performance degradation in my tests.  For
other operations which are going to do blocking i/o anyways, the cost of
the context switch often becomes noise.

The async readahead also fills a fills a hole in the proposed extensions
to preadv()/pwritev() -- they need some way to trigger and know when a
readahead operation has completed.  One needs a completion queue of some
sort to figure out which operation has completed in a reasonable
efficient manner.  The futex doesn't really have the ability to do this.

Thread dispatching is another problem the applications I work on
encounter, and AIO helps in this particular area because a thread that
is running hot can simply check the AIO event ring buffer for new events
in its main event loop.  Userspace fundamentally *cannot* do a good job of
dispatching work to threads.  The code I've see other developers come up
with ends up doing things like epoll() in one thread followed by
dispatching the receieved events to different threads.  This ends up
making multiple expensive syscalls (since locking and cross CPU bouncing
is required) when the kernel could just direct things to the right
thread in the first place.

There are a lot of requirements bringing additional complexity that start
to surface once you look at how some of these applications are actually
written.

> If you want to do arbitrary asynchronous system calls, just *do* it.
> But do _that_, not "let's extend this horrible interface in arbitrary
> random ways one special system call at a time".
> 
> In other words, why is the interface not simply: "do arbitrary system
> call X with arguments A, B, C, D asynchronously using a kernel
> thread".

We've had a few proposals to do this, none of which have really managed 
to tackle all the problems that arose.  If we go down this path, we will 
end up needing a table of what syscalls can actually be performed 
asynchronously, and flags indicating what bits of context those syscalls
require.  This does end up looking a bit like how AIO does things
depending on how hard you squint.

I'm not opposed to reworking how AIO dispatches things.  If we're willing 
to relax some constraints (like the hard enforced limits on the number
of AIOs in flight), things can be substantially simplified.  Again,
worries about things like memory usage today are vastly different than
they were back in the early '00s, so the decisions that make sense now
will certainly change the design.

Cancellation is also a concern.  Cancellation is not something that can
be sacrificed.  Without some mechanism to cancel operations that are in
flight, there is no way for a process to cleanly exit.  This patch
series nicely proves that signals work very well for cancellation, and
fit in with a lot of the code we already have.  This implies we would
need to treat threads doing async operations differently from normal
threads.  What happens with the pid namespace?

> That's something that a lot of people might use. In fact, if they can
> avoid the nasty AIO interface, maybe they'll even use it for things
> like read() and write().
> 
> So I really think it would be a nice thing to allow some kind of
> arbitrary "queue up asynchronous system call" model.
> 
> But I do not think the AIO model should be the model used for that,
> even if I think there might be some shared infrastructure.
> 
> So I would seriously suggest:
> 
>  - how about we add a true "asynchronous system call" interface
> 
>  - make it be a list of system calls with a futex completion for each
> list entry, so that you can easily wait for the end result that way.
> 
>  - maybe (and this is where it gets really iffy) you could even pass
> in the result of one system call to the next, so that you can do
> things like
> 
>        fd = openat(..)
>        ret = read(fd, ..)
> 
>    asynchronously and then just wait for the read() to complete.
> 
> and let us *not* tie this to the aio interface.
> 
> In fact, if we do it well, we can go the other way, and try to
> implement the nasty AIO interface on top of the generic "just do
> things asynchronously".
> 
> And I actually think many of your kernel thread parts are good for a
> generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
> all makes sense, although I do suspect you could just make it
> unconditional. The cost of a few atomics shouldn't be excessive when
> we're talking "use a thread to do op X".
> 
> What do you think? Do you think it might be possible to aim for a
> generic "do system call asynchronously" model instead?

Maybe it's not too bad to do -- the syscall() primitive is reasonably 
well defined and is supported across architectures, but we're going to 
need new wrappers for *every* syscall supported.  Odds are the work will 
have to be done incrementally to weed out which syscalls are safe and 
which are not, but there is certainly no reason we can't reuse syscall 
numbers and the same argument layout.

Chaining things becomes messy.  There are some cases where that works,
but at least on the applications I've worked on, there tends to be a
fair amount of logic that needs to be run before you can figure out what
and where the next operation is.  The canonical example I can think of
is the case where one is retreiving data from disk.  The first operation
is a read into some table to find out where data is located, the next
operation is a search (binary search in the case I'm thinking of) in the
data that was just read to figure out which record actually contains the
data the app cares about, followed by a read to actually fetch the data
the user actually requires.

And it gets more complicated: different disk i/os need to be issued with
different priorities (something that was not included in what I just
posted today, but is work I plan to propose for merging in the future).
In some cases the priority is known beforehand, but in other cases it
needs to be adjusted dynamically depending on information fetched (users
don't like it if huge i/os completely starve their smaller i/os for
significant amounts of time).

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

I'd certainly be interested in hearing more ideas concerning
requirements.

Sorry for the giant wall of text...  Nothing is simple! =-)

		-ben

>                Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  1:11     ` Dave Chinner
@ 2016-01-12  1:20       ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  1:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

And I think that's insufficient *also*.

What you actually want is "sync_file_range()", with the full set of arguments.

Yes, really. Sometimes you want to start the writeback, sometimes you
want to wait for it. Sometimes you want both.

For example, if you are doing your own manual write-behind logic, it
is not sufficient for "wait for data". What you want is "start IO on
new data" followed by "wait for old data to have been written out".

I think this only strengthens my "stop with the idiotic
special-case-AIO magic already" argument.  If we want something more
generic than the usual aio, then we should go all in. Not "let's make
more limited special cases".

              Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  1:20       ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  1:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

And I think that's insufficient *also*.

What you actually want is "sync_file_range()", with the full set of arguments.

Yes, really. Sometimes you want to start the writeback, sometimes you
want to wait for it. Sometimes you want both.

For example, if you are doing your own manual write-behind logic, it
is not sufficient for "wait for data". What you want is "start IO on
new data" followed by "wait for old data to have been written out".

I think this only strengthens my "stop with the idiotic
special-case-AIO magic already" argument.  If we want something more
generic than the usual aio, then we should go all in. Not "let's make
more limited special cases".

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  1:11     ` Dave Chinner
  (?)
@ 2016-01-12  1:30       ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12  1:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm,
	Alexander Viro, Andrew Morton, Linus Torvalds

On Tue, Jan 12, 2016 at 12:11:28PM +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

Noted.

> https://lkml.org/lkml/2015/10/28/878

Please at least Cc the aio list in the future on aio patches, as I do
not have the time to read linux-kernel these days unless prodded to do
so...

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  1:30       ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12  1:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm,
	Alexander Viro, Andrew Morton, Linus Torvalds

On Tue, Jan 12, 2016 at 12:11:28PM +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

Noted.

> https://lkml.org/lkml/2015/10/28/878

Please at least Cc the aio list in the future on aio patches, as I do
not have the time to read linux-kernel these days unless prodded to do
so...

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  1:30       ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12  1:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-aio, linux-fsdevel, linux-kernel, linux-api, linux-mm,
	Alexander Viro, Andrew Morton, Linus Torvalds

On Tue, Jan 12, 2016 at 12:11:28PM +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

Noted.

> https://lkml.org/lkml/2015/10/28/878

Please at least Cc the aio list in the future on aio patches, as I do
not have the time to read linux-kernel these days unless prodded to do
so...

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
  2016-01-12  0:22     ` Linus Torvalds
  (?)
@ 2016-01-12  1:45       ` Chris Mason
  -1 siblings, 0 replies; 133+ messages in thread
From: Chris Mason @ 2016-01-12  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Ingo Molnar, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.

[ ... ]

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

Zach Brown and Ingo traded a bunch of ideas.  There were chicklets and
syslets?  A little search, it looks like acall was a slightly different
iteration, but the patches didn't make it off oss.oracle.com:

https://lwn.net/Articles/316806/

-chris

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  1:45       ` Chris Mason
  0 siblings, 0 replies; 133+ messages in thread
From: Chris Mason @ 2016-01-12  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Ingo Molnar, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.

[ ... ]

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

Zach Brown and Ingo traded a bunch of ideas.  There were chicklets and
syslets?  A little search, it looks like acall was a slightly different
iteration, but the patches didn't make it off oss.oracle.com:

https://lwn.net/Articles/316806/

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  1:45       ` Chris Mason
  0 siblings, 0 replies; 133+ messages in thread
From: Chris Mason @ 2016-01-12  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Ingo Molnar, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.

[ ... ]

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

Zach Brown and Ingo traded a bunch of ideas.  There were chicklets and
syslets?  A little search, it looks like acall was a slightly different
iteration, but the patches didn't make it off oss.oracle.com:

https://lwn.net/Articles/316806/

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  1:20       ` Linus Torvalds
  (?)
@ 2016-01-12  2:25         ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  2:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 05:20:42PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Insufficient. Needs the range to be passed through and call
> > vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.

That's a different interface. the aio fsync interface has been
exposed to userspace for years, we just haven't implemented it in
the kernel. That's a major difference to everything else being
proposed in this patch set, especially this one.

FYI sync_file_range() is definitely not a fsync/fdatasync
replacement as it does not guarantee data durability in any way.
i.e. you can call sync_file_range, have it wait for data to be
written, return to userspace, then lose power and lose the data that
sync_file_range said it wrote. That's because sync_file_range()
does not:

	a) write the metadata needed to reference the data to disk;
	   and
	b) flush volatile storage caches after data and metadata is
	   written.

Hence sync_file_range is useless to applications that need to
guarantee data durability. Not to mention that most AIO applications
use direct IO, and so have no use for fine grained control over page
cache writeback semantics. They only require a) and b) above, so
implementing the AIO fsync primitive is exactly what they want.

> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.

Without durability guarantees such application level optimisations
are pretty much worthless.

> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

No, I don't think this specific case does, because the AIO fsync
interface already exists....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  2:25         ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  2:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 05:20:42PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Insufficient. Needs the range to be passed through and call
> > vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.

That's a different interface. the aio fsync interface has been
exposed to userspace for years, we just haven't implemented it in
the kernel. That's a major difference to everything else being
proposed in this patch set, especially this one.

FYI sync_file_range() is definitely not a fsync/fdatasync
replacement as it does not guarantee data durability in any way.
i.e. you can call sync_file_range, have it wait for data to be
written, return to userspace, then lose power and lose the data that
sync_file_range said it wrote. That's because sync_file_range()
does not:

	a) write the metadata needed to reference the data to disk;
	   and
	b) flush volatile storage caches after data and metadata is
	   written.

Hence sync_file_range is useless to applications that need to
guarantee data durability. Not to mention that most AIO applications
use direct IO, and so have no use for fine grained control over page
cache writeback semantics. They only require a) and b) above, so
implementing the AIO fsync primitive is exactly what they want.

> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.

Without durability guarantees such application level optimisations
are pretty much worthless.

> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

No, I don't think this specific case does, because the AIO fsync
interface already exists....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  2:25         ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  2:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel, Linux Kernel Mailing List, Linux API, linux-mm,
	Alexander Viro, Andrew Morton

On Mon, Jan 11, 2016 at 05:20:42PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> >
> > Insufficient. Needs the range to be passed through and call
> > vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.

That's a different interface. the aio fsync interface has been
exposed to userspace for years, we just haven't implemented it in
the kernel. That's a major difference to everything else being
proposed in this patch set, especially this one.

FYI sync_file_range() is definitely not a fsync/fdatasync
replacement as it does not guarantee data durability in any way.
i.e. you can call sync_file_range, have it wait for data to be
written, return to userspace, then lose power and lose the data that
sync_file_range said it wrote. That's because sync_file_range()
does not:

	a) write the metadata needed to reference the data to disk;
	   and
	b) flush volatile storage caches after data and metadata is
	   written.

Hence sync_file_range is useless to applications that need to
guarantee data durability. Not to mention that most AIO applications
use direct IO, and so have no use for fine grained control over page
cache writeback semantics. They only require a) and b) above, so
implementing the AIO fsync primitive is exactly what they want.

> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.

Without durability guarantees such application level optimisations
are pretty much worthless.

> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

No, I don't think this specific case does, because the AIO fsync
interface already exists....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  2:25         ` Dave Chinner
@ 2016-01-12  2:38           ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  2:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> That's a different interface.

So is openat. So is readahead.

My point is that this idiotic "let's expose special cases" must end.
It's broken. It inevitably only exposes a subset of what different
people would want.

Making "aio_read()" and friends a special interface had historical
reasons for it. But expanding willy-nilly on that model does not.

               Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  2:38           ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  2:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> That's a different interface.

So is openat. So is readahead.

My point is that this idiotic "let's expose special cases" must end.
It's broken. It inevitably only exposes a subset of what different
people would want.

Making "aio_read()" and friends a special interface had historical
reasons for it. But expanding willy-nilly on that model does not.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  2:38           ` Linus Torvalds
@ 2016-01-12  3:37             ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  3:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 06:38:15PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > That's a different interface.
> 
> So is openat. So is readahead.
>
> My point is that this idiotic "let's expose special cases" must end.
> It's broken. It inevitably only exposes a subset of what different
> people would want.
> 
> Making "aio_read()" and friends a special interface had historical
> reasons for it. But expanding willy-nilly on that model does not.

Yes, I heard you the first time, but you haven't acknowledged that
the aio fsync interface is indeed different because it already
exists. What's the problem with implementing an AIO call that we've
advertised as supported for many years now that people are asking us
to implement it?

As for a generic async syscall interface, why not just add
IOCB_CMD_SYSCALL that encodes the syscall number and parameters
into the iovec structure and let the existing aio subsystem handle
demultiplexing it and handing them off to threads/workqueues/etc?
That was we get contexts, events, signals, completions,
cancelations, etc from the existing infrastructure, and there's
really only a dispatch/collection layer that needs to be added?

If we then provide the userspace interface via the libaio library to
call the async syscalls with an AIO context handle, then there's
little more that needs to be done to support just about everything
as an async syscall...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  3:37             ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-12  3:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 06:38:15PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > That's a different interface.
> 
> So is openat. So is readahead.
>
> My point is that this idiotic "let's expose special cases" must end.
> It's broken. It inevitably only exposes a subset of what different
> people would want.
> 
> Making "aio_read()" and friends a special interface had historical
> reasons for it. But expanding willy-nilly on that model does not.

Yes, I heard you the first time, but you haven't acknowledged that
the aio fsync interface is indeed different because it already
exists. What's the problem with implementing an AIO call that we've
advertised as supported for many years now that people are asking us
to implement it?

As for a generic async syscall interface, why not just add
IOCB_CMD_SYSCALL that encodes the syscall number and parameters
into the iovec structure and let the existing aio subsystem handle
demultiplexing it and handing them off to threads/workqueues/etc?
That was we get contexts, events, signals, completions,
cancelations, etc from the existing infrastructure, and there's
really only a dispatch/collection layer that needs to be added?

If we then provide the userspace interface via the libaio library to
call the async syscalls with an AIO context handle, then there's
little more that needs to be done to support just about everything
as an async syscall...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  3:37             ` Dave Chinner
@ 2016-01-12  4:03               ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  4:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Yes, I heard you the first time, but you haven't acknowledged that
> the aio fsync interface is indeed different because it already
> exists. What's the problem with implementing an AIO call that we've
> advertised as supported for many years now that people are asking us
> to implement it?

Oh, I don't disagree with that. I think it should be exposed, my point
was that that too was not enough.

I don't see why you argue. You said "that's not enough". And I jjust
said that your expansion wasn't sufficient either, and that I think we
should strive to expand things even more.

And preferably not in some ad-hoc manner. Expand it to *everything* we can do.

> As for a generic async syscall interface, why not just add
> IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> into the iovec structure and let the existing aio subsystem handle
> demultiplexing it and handing them off to threads/workqueues/etc?

That would likely be the simplest approach, yes.

There's a few arguments against it, though:

 - doing the indirect system call thing does end up being
architecture-specific, so now you do need the AIO code to call into
some arch wrapper.

   Not a huge deal, since the arch wrapper will be pretty simple (and
we can have a default one that just returns ENOSYS, so that we don't
have to synchronize all architectures)

 - the aio interface really is horrible crap. Really really.

   For example, the whole "send signal as a completion model" is so
f*cking broken that I really don't want to extend the aio interface
too much. I think it's unfixable.

So I really think we'd be *much* better off with a new interface
entirely - preferably one that allows the old aio interfaces to fall
out fairly naturally.

Ben mentioned lio_listio() as a reason for why he wanted to extend the
AIO interface, but I think it works the other way around: yes, we
should look at lio_listio(), but we should look at it mainly as a way
to ask ourselves: "can we implement a new aynchronous system call
submission model that would also make it possible to implement
lio_listio() as a user space wrapper around it".

For example, if we had an actual _good_ way to queue up things, you
could probably make that "struct sigevent" completion for lio_listio()
just be another asynchronous system call at the end of the list - a
system call that sends the completion signal.  And the aiocb_list[]
itself? Maybe those could just be done as normal (individual) aio
calls (so that you end up having the aiocb that you can wait on with
aio_suspend() etc).

But then people who do *not* want the crazy aiocb, and do *not* want
some SIGIO or whatever, could just fire off asynchronous system calls
without that cruddy interface.

So my argument is really that I think it would be better to at least
look into maybe creating something less crapulent, and striving to
make it easy to make the old legacy interfaces be just wrappers around
a more capable model.

And hey, it may be that in the end nobody cares enough, and the right
thing (or at least the prudent thing) to do is to just pile the crap
on deeper and higher, and just add a single IOCB_CMD_SYSCALL
indirection entry.

So I'm not dismissing that as a solution - I just don't think it's a
particularly clean one.

It does have the advantage of likely being a fairly simple hack. But
it smells like a hack.

                Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  4:03               ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  4:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Yes, I heard you the first time, but you haven't acknowledged that
> the aio fsync interface is indeed different because it already
> exists. What's the problem with implementing an AIO call that we've
> advertised as supported for many years now that people are asking us
> to implement it?

Oh, I don't disagree with that. I think it should be exposed, my point
was that that too was not enough.

I don't see why you argue. You said "that's not enough". And I jjust
said that your expansion wasn't sufficient either, and that I think we
should strive to expand things even more.

And preferably not in some ad-hoc manner. Expand it to *everything* we can do.

> As for a generic async syscall interface, why not just add
> IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> into the iovec structure and let the existing aio subsystem handle
> demultiplexing it and handing them off to threads/workqueues/etc?

That would likely be the simplest approach, yes.

There's a few arguments against it, though:

 - doing the indirect system call thing does end up being
architecture-specific, so now you do need the AIO code to call into
some arch wrapper.

   Not a huge deal, since the arch wrapper will be pretty simple (and
we can have a default one that just returns ENOSYS, so that we don't
have to synchronize all architectures)

 - the aio interface really is horrible crap. Really really.

   For example, the whole "send signal as a completion model" is so
f*cking broken that I really don't want to extend the aio interface
too much. I think it's unfixable.

So I really think we'd be *much* better off with a new interface
entirely - preferably one that allows the old aio interfaces to fall
out fairly naturally.

Ben mentioned lio_listio() as a reason for why he wanted to extend the
AIO interface, but I think it works the other way around: yes, we
should look at lio_listio(), but we should look at it mainly as a way
to ask ourselves: "can we implement a new aynchronous system call
submission model that would also make it possible to implement
lio_listio() as a user space wrapper around it".

For example, if we had an actual _good_ way to queue up things, you
could probably make that "struct sigevent" completion for lio_listio()
just be another asynchronous system call at the end of the list - a
system call that sends the completion signal.  And the aiocb_list[]
itself? Maybe those could just be done as normal (individual) aio
calls (so that you end up having the aiocb that you can wait on with
aio_suspend() etc).

But then people who do *not* want the crazy aiocb, and do *not* want
some SIGIO or whatever, could just fire off asynchronous system calls
without that cruddy interface.

So my argument is really that I think it would be better to at least
look into maybe creating something less crapulent, and striving to
make it easy to make the old legacy interfaces be just wrappers around
a more capable model.

And hey, it may be that in the end nobody cares enough, and the right
thing (or at least the prudent thing) to do is to just pile the crap
on deeper and higher, and just add a single IOCB_CMD_SYSCALL
indirection entry.

So I'm not dismissing that as a solution - I just don't think it's a
particularly clean one.

It does have the advantage of likely being a fairly simple hack. But
it smells like a hack.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  4:03               ` Linus Torvalds
@ 2016-01-12  4:48                 ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  4:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So my argument is really that I think it would be better to at least
> look into maybe creating something less crapulent, and striving to
> make it easy to make the old legacy interfaces be just wrappers around
> a more capable model.

Hmm. Thinking more about this makes me worry about all the system call
versioning and extra work done by libc.

At least glibc has traditionally decided to munge and extend on kernel
system call interfaces, to the point where even fairly core data
structures (like "struct stat") may not always look the same to the
kernel as they do to user space.

So with that worry, I have to admit that maybe a limited interface -
rather than allowing arbitrary generic async system calls - might have
advantages. Less room for mismatches.

I'll have to think about this some more.

                  Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12  4:48                 ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-12  4:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So my argument is really that I think it would be better to at least
> look into maybe creating something less crapulent, and striving to
> make it easy to make the old legacy interfaces be just wrappers around
> a more capable model.

Hmm. Thinking more about this makes me worry about all the system call
versioning and extra work done by libc.

At least glibc has traditionally decided to munge and extend on kernel
system call interfaces, to the point where even fairly core data
structures (like "struct stat") may not always look the same to the
kernel as they do to user space.

So with that worry, I have to admit that maybe a limited interface -
rather than allowing arbitrary generic async system calls - might have
advantages. Less room for mismatches.

I'll have to think about this some more.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
  2016-01-12  0:22     ` Linus Torvalds
  (?)
@ 2016-01-12  9:53       ` Ingo Molnar
  -1 siblings, 0 replies; 133+ messages in thread
From: Ingo Molnar @ 2016-01-12  9:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> What do you think? Do you think it might be possible to aim for a generic "do 
> system call asynchronously" model instead?
> 
> I'm adding Ingo the to cc, because I think Ingo had a "run this list of system 
> calls" patch at one point - in order to avoid system call overhead. I don't 
> think that was very interesting (because system call overhead is seldom all that 
> noticeable for any interesting system calls), but with the "let's do the list 
> asynchronously" addition it might be much more intriguing. Ingo, do I remember 
> correctly that it was you? I might be confused about who wrote that patch, and I 
> can't find it now.

Yeah, it was the whole 'syslets' and 'threadlets' stuff - I had both implemented 
and prototyped into a 'list directory entries asynchronously' testcase.

Threadlets was pretty close to what you are suggesting now. Here's a very good (as 
usual!) writeup from LWN:

  https://lwn.net/Articles/223899/

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  9:53       ` Ingo Molnar
  0 siblings, 0 replies; 133+ messages in thread
From: Ingo Molnar @ 2016-01-12  9:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> What do you think? Do you think it might be possible to aim for a generic "do 
> system call asynchronously" model instead?
> 
> I'm adding Ingo the to cc, because I think Ingo had a "run this list of system 
> calls" patch at one point - in order to avoid system call overhead. I don't 
> think that was very interesting (because system call overhead is seldom all that 
> noticeable for any interesting system calls), but with the "let's do the list 
> asynchronously" addition it might be much more intriguing. Ingo, do I remember 
> correctly that it was you? I might be confused about who wrote that patch, and I 
> can't find it now.

Yeah, it was the whole 'syslets' and 'threadlets' stuff - I had both implemented 
and prototyped into a 'list directory entries asynchronously' testcase.

Threadlets was pretty close to what you are suggesting now. Here's a very good (as 
usual!) writeup from LWN:

  https://lwn.net/Articles/223899/

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/13] aio: add support for async openat()
@ 2016-01-12  9:53       ` Ingo Molnar
  0 siblings, 0 replies; 133+ messages in thread
From: Ingo Molnar @ 2016-01-12  9:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> What do you think? Do you think it might be possible to aim for a generic "do 
> system call asynchronously" model instead?
> 
> I'm adding Ingo the to cc, because I think Ingo had a "run this list of system 
> calls" patch at one point - in order to avoid system call overhead. I don't 
> think that was very interesting (because system call overhead is seldom all that 
> noticeable for any interesting system calls), but with the "let's do the list 
> asynchronously" addition it might be much more intriguing. Ingo, do I remember 
> correctly that it was you? I might be confused about who wrote that patch, and I 
> can't find it now.

Yeah, it was the whole 'syslets' and 'threadlets' stuff - I had both implemented 
and prototyped into a 'list directory entries asynchronously' testcase.

Threadlets was pretty close to what you are suggesting now. Here's a very good (as 
usual!) writeup from LWN:

  https://lwn.net/Articles/223899/

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  4:48                 ` Linus Torvalds
  (?)
@ 2016-01-12 22:50                   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.

That is one of my worries, and one of the reasons an async getdents64() 
or readdir() operation isn't in this batch -- there are a ton of ABI 
issues glibc handles on some platforms.

> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.
> 
>                   Linus

I think some cleanups can be made on how and where the AIO operations
are implemented.  A first stab is below (not very tested as of yet,
still have more work to do) that uses an array to dispatch AIO submits.
By using function pointers to dispatch the operations fairly early in
the process, the code that actually does the required verifications is
less spread out and much easier to follow instead of the giant select
cases.

Another possible improvement might be to move things like aio_fsync()
into sync.c with all the other relevant sync code.  That would make much
more sense and make it much more obvious as to which subsystem
maintainers a given set of functionality really belongs.  If that sounds
like an improvement, I can put some effort into that as well.

		-ben

 aio.c |  242 ++++++++++++++++++++++++++++++++----------------------------------
 1 file changed, 118 insertions(+), 124 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f776dff..0c06e3b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -177,6 +177,12 @@ typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
  */
 #define KIOCB_CANCELLED		((void *) (~0ULL))
 
+#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
+#define AIO_THREAD_NEED_MM	0x0010	/* Need the mm context */
+
 struct aio_kiocb {
 	struct kiocb		common;
 
@@ -205,6 +211,7 @@ struct aio_kiocb {
 	struct task_struct	*ki_cancel_task;
 	unsigned long		ki_data;
 	unsigned long		ki_rlimit_fsize;
+	unsigned		ki_thread_flags;	/* AIO_THREAD_NEED... */
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
 	struct fs_struct	*ki_fs;
@@ -231,16 +238,8 @@ static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
-ssize_t aio_fsync(struct kiocb *iocb, int datasync);
-long aio_poll(struct aio_kiocb *iocb);
 
 typedef long (*do_foo_at_t)(int fd, const char *filename, int flags, int mode);
-long aio_do_openat(int fd, const char *filename, int flags, int mode);
-long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
-long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
-
-long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
-long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1533,9 +1532,13 @@ static void aio_thread_fn(struct work_struct *work)
 	old_cancel = cmpxchg(&iocb->ki_cancel,
 			     aio_thread_queue_iocb_cancel_early,
 			     aio_thread_queue_iocb_cancel);
-	if (old_cancel != KIOCB_CANCELLED)
+	if (old_cancel != KIOCB_CANCELLED) {
+		if (iocb->ki_thread_flags & AIO_THREAD_NEED_MM)
+			use_mm(iocb->ki_ctx->mm);
 		ret = iocb->ki_work_fn(iocb);
-	else
+		if (iocb->ki_thread_flags & AIO_THREAD_NEED_MM)
+			unuse_mm(iocb->ki_ctx->mm);
+	} else
 		ret = -EINTR;
 
 	current->kiocb = NULL;
@@ -1566,11 +1569,6 @@ static void aio_thread_fn(struct work_struct *work)
 		flush_signals(current);
 }
 
-#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
-#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
-#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
-#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
-
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
  *	aio_kiocb for cancellation.  The caller must provide a function to
@@ -1581,7 +1579,10 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 				     aio_thread_work_fn_t work_fn,
 				     unsigned flags)
 {
+	if (!aio_may_use_threads())
+		return -EINVAL;
 	INIT_WORK(&iocb->ki_work, aio_thread_fn);
+	iocb->ki_thread_flags = flags;
 	iocb->ki_work_fn = work_fn;
 	if (flags & AIO_THREAD_NEED_TASK) {
 		iocb->ki_submit_task = current;
@@ -1618,7 +1619,6 @@ static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
 	struct file *filp;
 	long ret;
 
-	use_mm(iocb->ki_ctx->mm);
 	filp = iocb->common.ki_filp;
 
 	if (filp->f_op->read_iter) {
@@ -1633,7 +1633,6 @@ static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
 					   filp->f_op->read);
 	else
 		ret = -EINVAL;
-	unuse_mm(iocb->ki_ctx->mm);
 	return ret;
 }
 
@@ -1656,7 +1655,7 @@ ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		return -EINVAL;
 
 	return aio_thread_queue_iocb(req, aio_thread_op_read_iter,
-				     AIO_THREAD_NEED_TASK);
+				     AIO_THREAD_NEED_TASK | AIO_THREAD_NEED_MM);
 }
 EXPORT_SYMBOL(generic_async_read_iter);
 
@@ -1666,7 +1665,6 @@ static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
 	struct file *filp;
 	long ret;
 
-	use_mm(iocb->ki_ctx->mm);
 	filp = iocb->common.ki_filp;
 	saved_rlim_fsize = rlimit(RLIMIT_FSIZE);
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = iocb->ki_rlimit_fsize;
@@ -1684,7 +1682,6 @@ static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
 	else
 		ret = -EINVAL;
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = saved_rlim_fsize;
-	unuse_mm(iocb->ki_ctx->mm);
 	return ret;
 }
 
@@ -1708,28 +1705,13 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	req->ki_rlimit_fsize = rlimit(RLIMIT_FSIZE);
 
 	return aio_thread_queue_iocb(req, aio_thread_op_write_iter,
-				     AIO_THREAD_NEED_TASK);
+				     AIO_THREAD_NEED_TASK | AIO_THREAD_NEED_MM);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
 
 static long aio_thread_op_fsync(struct aio_kiocb *iocb)
 {
-	return vfs_fsync(iocb->common.ki_filp, 0);
-}
-
-static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
-{
-	return vfs_fsync(iocb->common.ki_filp, 1);
-}
-
-ssize_t aio_fsync(struct kiocb *iocb, int datasync)
-{
-	struct aio_kiocb *req;
-
-	req = container_of(iocb, struct aio_kiocb, common);
-
-	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
-						   : aio_thread_op_fsync, 0);
+	return vfs_fsync(iocb->common.ki_filp, iocb->ki_data);
 }
 
 static long aio_thread_op_poll(struct aio_kiocb *iocb)
@@ -1766,17 +1748,22 @@ static long aio_thread_op_poll(struct aio_kiocb *iocb)
 	return ret;
 }
 
-long aio_poll(struct aio_kiocb *req)
+static long aio_poll(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 {
+	if (!req->common.ki_filp->f_op->poll)
+		return -EINVAL;
+	if ((unsigned short)user_iocb->aio_buf != user_iocb->aio_buf)
+		return -EINVAL;
+	req->ki_data = user_iocb->aio_buf;
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-long aio_do_openat(int fd, const char *filename, int flags, int mode)
+static long aio_do_openat(int fd, const char *filename, int flags, int mode)
 {
 	return do_sys_open(fd, filename, flags, mode);
 }
 
-long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
+static long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
 {
 	if (flags || mode)
 		return -EINVAL;
@@ -1789,7 +1776,6 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 	long ret;
 	u32 fd;
 
-	use_mm(req->ki_ctx->mm);
 	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
 		ret = -EFAULT;
 	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
@@ -1804,15 +1790,25 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 				(int)offset,
 				(unsigned short)(offset >> 32));
 	}
-	unuse_mm(req->ki_ctx->mm);
 	return ret;
 }
 
-long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
+static long aio_openat(struct aio_kiocb *req, struct iocb *uiocb, bool compat)
 {
-	req->ki_data = (unsigned long)(void *)do_foo_at;
+	req->ki_data = (unsigned long)(void *)aio_do_openat;
 	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
 				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
+
+static long aio_unlink(struct aio_kiocb *req, struct iocb *uiocb, bool compt)
+{
+	req->ki_data = (unsigned long)(void *)aio_do_unlinkat;
+	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
@@ -1898,17 +1894,23 @@ static long aio_thread_op_readahead(struct aio_kiocb *iocb)
 	return 0;
 }
 
-long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
+static long aio_ra(struct aio_kiocb *iocb, struct iocb *uiocb, bool compat)
 {
 	struct address_space *mapping = iocb->common.ki_filp->f_mapping;
 	pgoff_t index, end;
 	loff_t epos, isize;
 	int do_io = 0;
+	size_t len;
 
+	if (!aio_may_use_threads())
+		return -EINVAL;
+	if (uiocb->aio_buf)
+		return -EINVAL;
 	if (!mapping || !mapping->a_ops)
 		return -EBADF;
 	if (!mapping->a_ops->readpage && !mapping->a_ops->readpages)
 		return -EBADF;
+	len = uiocb->aio_nbytes;
 	if (!len)
 		return 0;
 
@@ -1958,7 +1960,6 @@ static long aio_thread_op_renameat(struct aio_kiocb *iocb)
 	unsigned flags;
 	long ret;
 
-	use_mm(aio_get_mm(&iocb->common));
 	if (unlikely(copy_from_user(&info, user_info, sizeof(info)))) {
 		ret = -EFAULT;
 		goto done;
@@ -1979,39 +1980,47 @@ static long aio_thread_op_renameat(struct aio_kiocb *iocb)
 	else
 		ret = sys_renameat2(olddir, old, newdir, new, flags);
 done:
-	unuse_mm(aio_get_mm(&iocb->common));
 	return ret;
 }
 
-long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb)
+static long aio_rename(struct aio_kiocb *iocb, struct iocb *user_iocb, bool c)
 {
-	const void * __user user_info;
-
 	if (user_iocb->aio_nbytes != sizeof(struct renameat_info))
 		return -EINVAL;
 	if (user_iocb->aio_offset)
 		return -EINVAL;
 
-	user_info = (const void * __user)(long)user_iocb->aio_buf;
-	if (unlikely(!access_ok(VERIFY_READ, user_info,
-				sizeof(struct renameat_info))))
-		return -EFAULT;
-
-	iocb->common.private = (void *)user_info;
+	iocb->common.private = (void *)(long)user_iocb->aio_buf;
 	return aio_thread_queue_iocb(iocb, aio_thread_op_renameat,
 				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
 				     AIO_THREAD_NEED_FS |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
+long aio_fsync(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
+{
+	bool datasync = (user_iocb->aio_lio_opcode == IOCB_CMD_FDSYNC);
+	struct file *file = req->common.ki_filp;
+
+	if (file->f_op->aio_fsync)
+		return file->f_op->aio_fsync(&req->common, datasync);
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	if (file->f_op->fsync) {
+		req->ki_data = datasync;
+		return aio_thread_queue_iocb(req, aio_thread_op_fsync, 0);
+	}
+#endif
+	return -EINVAL;
+}
+
 /*
- * aio_run_iocb:
- *	Performs the initial checks and io submission.
+ * aio_rw:
+ *	Implements read/write vectored and non-vectored
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
-			    bool compat)
+static long aio_rw(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
@@ -2085,70 +2094,42 @@ rw_common:
 			file_end_write(file);
 		break;
 
-	case IOCB_CMD_FDSYNC:
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(&req->common, 1);
-		else if (file->f_op->fsync && (aio_may_use_threads()))
-			ret = aio_fsync(&req->common, 1);
-		break;
-
-	case IOCB_CMD_FSYNC:
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(&req->common, 0);
-		else if (file->f_op->fsync && (aio_may_use_threads()))
-			ret = aio_fsync(&req->common, 0);
-		break;
-
-	case IOCB_CMD_POLL:
-		if (aio_may_use_threads())
-			ret = aio_poll(req);
-		break;
-
-	case IOCB_CMD_OPENAT:
-		if (aio_may_use_threads())
-			ret = aio_foo_at(req, aio_do_openat);
-		break;
-
-	case IOCB_CMD_UNLINKAT:
-		if (aio_may_use_threads())
-			ret = aio_foo_at(req, aio_do_unlinkat);
-		break;
-
-	case IOCB_CMD_READAHEAD:
-		if (user_iocb->aio_buf)
-			return -EINVAL;
-		if (aio_may_use_threads())
-			ret = aio_readahead(req, user_iocb->aio_nbytes);
-		break;
-
-	case IOCB_CMD_RENAMEAT:
-		if (aio_may_use_threads())
-			ret = aio_renameat(req, user_iocb);
-		break;
-
 	default:
 		pr_debug("EINVAL: no operation provided\n");
-		return -EINVAL;
 	}
+	return ret;
+}
 
-	if (ret != -EIOCBQUEUED) {
-		/*
-		 * There's no easy way to restart the syscall since other AIO's
-		 * may be already running. Just fail this IO with EINTR.
-		 */
-		if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
-			     ret == -ERESTARTNOHAND ||
-			     ret == -ERESTART_RESTARTBLOCK))
-			ret = -EINTR;
-		aio_complete(&req->common, ret, 0);
-	}
+typedef long (*aio_submit_fn_t)(struct aio_kiocb *req, struct iocb *iocb,
+				bool compat);
 
-	return 0;
-}
+#define NEED_FD			0x0001
+
+struct submit_info {
+	aio_submit_fn_t		fn;
+	unsigned long		flags;
+};
+
+static const struct submit_info aio_submit_info[] = {
+	[IOCB_CMD_PREAD]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PWRITE]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PREADV]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PWRITEV]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_FSYNC]	= { aio_fsync,	NEED_FD },
+	[IOCB_CMD_FDSYNC]	= { aio_fsync,	NEED_FD },
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	[IOCB_CMD_POLL]		= { aio_poll,	NEED_FD },
+	[IOCB_CMD_OPENAT]	= { aio_openat,	0 },
+	[IOCB_CMD_UNLINKAT]	= { aio_unlink,	0 },
+	[IOCB_CMD_READAHEAD]	= { aio_ra,	NEED_FD },
+	[IOCB_CMD_RENAMEAT]	= { aio_rename,	0 },
+#endif
+};
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
+	const struct submit_info *submit_info;
 	struct aio_kiocb *req;
 	ssize_t ret;
 
@@ -2168,23 +2149,26 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		return -EINVAL;
 	}
 
+	if (unlikely(iocb->aio_lio_opcode >= ARRAY_SIZE(aio_submit_info)))
+		return -EINVAL;
+	submit_info = &aio_submit_info[iocb->aio_lio_opcode];
+	if (unlikely(!submit_info->fn))
+		return -EINVAL;
+
 	req = aio_get_req(ctx);
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
-		req->common.ki_filp = NULL;
-	else {
+	if (submit_info->flags & NEED_FD) {
 		req->common.ki_filp = fget(iocb->aio_fildes);
 		if (unlikely(!req->common.ki_filp)) {
 			ret = -EBADF;
 			goto out_put_req;
 		}
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	if (req->common.ki_filp)
-		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -2212,10 +2196,20 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb, compat);
-	if (ret)
-		goto out_put_req;
-
+	ret = submit_info->fn(req, iocb, compat);
+	if (ret != -EIOCBQUEUED) {
+		/*
+		 * There's no easy way to restart the syscall since other AIO's
+		 * may be already running. Just fail this IO with EINTR.
+		 */
+		if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+			     ret == -ERESTARTNOHAND ||
+			     ret == -ERESTART_RESTARTBLOCK))
+			ret = -EINTR;
+		else if (IS_ERR_VALUE(ret))
+			goto out_put_req;
+		aio_complete(&req->common, ret, 0);
+	}
 	return 0;
 out_put_req:
 	put_reqs_available(ctx, 1);
-- 
"Thought is the essence of where you are now."

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12 22:50                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.

That is one of my worries, and one of the reasons an async getdents64() 
or readdir() operation isn't in this batch -- there are a ton of ABI 
issues glibc handles on some platforms.

> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.
> 
>                   Linus

I think some cleanups can be made on how and where the AIO operations
are implemented.  A first stab is below (not very tested as of yet,
still have more work to do) that uses an array to dispatch AIO submits.
By using function pointers to dispatch the operations fairly early in
the process, the code that actually does the required verifications is
less spread out and much easier to follow instead of the giant select
cases.

Another possible improvement might be to move things like aio_fsync()
into sync.c with all the other relevant sync code.  That would make much
more sense and make it much more obvious as to which subsystem
maintainers a given set of functionality really belongs.  If that sounds
like an improvement, I can put some effort into that as well.

		-ben

 aio.c |  242 ++++++++++++++++++++++++++++++++----------------------------------
 1 file changed, 118 insertions(+), 124 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f776dff..0c06e3b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -177,6 +177,12 @@ typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
  */
 #define KIOCB_CANCELLED		((void *) (~0ULL))
 
+#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
+#define AIO_THREAD_NEED_MM	0x0010	/* Need the mm context */
+
 struct aio_kiocb {
 	struct kiocb		common;
 
@@ -205,6 +211,7 @@ struct aio_kiocb {
 	struct task_struct	*ki_cancel_task;
 	unsigned long		ki_data;
 	unsigned long		ki_rlimit_fsize;
+	unsigned		ki_thread_flags;	/* AIO_THREAD_NEED... */
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
 	struct fs_struct	*ki_fs;
@@ -231,16 +238,8 @@ static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
-ssize_t aio_fsync(struct kiocb *iocb, int datasync);
-long aio_poll(struct aio_kiocb *iocb);
 
 typedef long (*do_foo_at_t)(int fd, const char *filename, int flags, int mode);
-long aio_do_openat(int fd, const char *filename, int flags, int mode);
-long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
-long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
-
-long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
-long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1533,9 +1532,13 @@ static void aio_thread_fn(struct work_struct *work)
 	old_cancel = cmpxchg(&iocb->ki_cancel,
 			     aio_thread_queue_iocb_cancel_early,
 			     aio_thread_queue_iocb_cancel);
-	if (old_cancel != KIOCB_CANCELLED)
+	if (old_cancel != KIOCB_CANCELLED) {
+		if (iocb->ki_thread_flags & AIO_THREAD_NEED_MM)
+			use_mm(iocb->ki_ctx->mm);
 		ret = iocb->ki_work_fn(iocb);
-	else
+		if (iocb->ki_thread_flags & AIO_THREAD_NEED_MM)
+			unuse_mm(iocb->ki_ctx->mm);
+	} else
 		ret = -EINTR;
 
 	current->kiocb = NULL;
@@ -1566,11 +1569,6 @@ static void aio_thread_fn(struct work_struct *work)
 		flush_signals(current);
 }
 
-#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
-#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
-#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
-#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
-
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
  *	aio_kiocb for cancellation.  The caller must provide a function to
@@ -1581,7 +1579,10 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 				     aio_thread_work_fn_t work_fn,
 				     unsigned flags)
 {
+	if (!aio_may_use_threads())
+		return -EINVAL;
 	INIT_WORK(&iocb->ki_work, aio_thread_fn);
+	iocb->ki_thread_flags = flags;
 	iocb->ki_work_fn = work_fn;
 	if (flags & AIO_THREAD_NEED_TASK) {
 		iocb->ki_submit_task = current;
@@ -1618,7 +1619,6 @@ static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
 	struct file *filp;
 	long ret;
 
-	use_mm(iocb->ki_ctx->mm);
 	filp = iocb->common.ki_filp;
 
 	if (filp->f_op->read_iter) {
@@ -1633,7 +1633,6 @@ static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
 					   filp->f_op->read);
 	else
 		ret = -EINVAL;
-	unuse_mm(iocb->ki_ctx->mm);
 	return ret;
 }
 
@@ -1656,7 +1655,7 @@ ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		return -EINVAL;
 
 	return aio_thread_queue_iocb(req, aio_thread_op_read_iter,
-				     AIO_THREAD_NEED_TASK);
+				     AIO_THREAD_NEED_TASK | AIO_THREAD_NEED_MM);
 }
 EXPORT_SYMBOL(generic_async_read_iter);
 
@@ -1666,7 +1665,6 @@ static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
 	struct file *filp;
 	long ret;
 
-	use_mm(iocb->ki_ctx->mm);
 	filp = iocb->common.ki_filp;
 	saved_rlim_fsize = rlimit(RLIMIT_FSIZE);
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = iocb->ki_rlimit_fsize;
@@ -1684,7 +1682,6 @@ static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
 	else
 		ret = -EINVAL;
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = saved_rlim_fsize;
-	unuse_mm(iocb->ki_ctx->mm);
 	return ret;
 }
 
@@ -1708,28 +1705,13 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	req->ki_rlimit_fsize = rlimit(RLIMIT_FSIZE);
 
 	return aio_thread_queue_iocb(req, aio_thread_op_write_iter,
-				     AIO_THREAD_NEED_TASK);
+				     AIO_THREAD_NEED_TASK | AIO_THREAD_NEED_MM);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
 
 static long aio_thread_op_fsync(struct aio_kiocb *iocb)
 {
-	return vfs_fsync(iocb->common.ki_filp, 0);
-}
-
-static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
-{
-	return vfs_fsync(iocb->common.ki_filp, 1);
-}
-
-ssize_t aio_fsync(struct kiocb *iocb, int datasync)
-{
-	struct aio_kiocb *req;
-
-	req = container_of(iocb, struct aio_kiocb, common);
-
-	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
-						   : aio_thread_op_fsync, 0);
+	return vfs_fsync(iocb->common.ki_filp, iocb->ki_data);
 }
 
 static long aio_thread_op_poll(struct aio_kiocb *iocb)
@@ -1766,17 +1748,22 @@ static long aio_thread_op_poll(struct aio_kiocb *iocb)
 	return ret;
 }
 
-long aio_poll(struct aio_kiocb *req)
+static long aio_poll(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 {
+	if (!req->common.ki_filp->f_op->poll)
+		return -EINVAL;
+	if ((unsigned short)user_iocb->aio_buf != user_iocb->aio_buf)
+		return -EINVAL;
+	req->ki_data = user_iocb->aio_buf;
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-long aio_do_openat(int fd, const char *filename, int flags, int mode)
+static long aio_do_openat(int fd, const char *filename, int flags, int mode)
 {
 	return do_sys_open(fd, filename, flags, mode);
 }
 
-long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
+static long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
 {
 	if (flags || mode)
 		return -EINVAL;
@@ -1789,7 +1776,6 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 	long ret;
 	u32 fd;
 
-	use_mm(req->ki_ctx->mm);
 	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
 		ret = -EFAULT;
 	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
@@ -1804,15 +1790,25 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 				(int)offset,
 				(unsigned short)(offset >> 32));
 	}
-	unuse_mm(req->ki_ctx->mm);
 	return ret;
 }
 
-long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
+static long aio_openat(struct aio_kiocb *req, struct iocb *uiocb, bool compat)
 {
-	req->ki_data = (unsigned long)(void *)do_foo_at;
+	req->ki_data = (unsigned long)(void *)aio_do_openat;
 	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
 				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
+
+static long aio_unlink(struct aio_kiocb *req, struct iocb *uiocb, bool compt)
+{
+	req->ki_data = (unsigned long)(void *)aio_do_unlinkat;
+	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
@@ -1898,17 +1894,23 @@ static long aio_thread_op_readahead(struct aio_kiocb *iocb)
 	return 0;
 }
 
-long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
+static long aio_ra(struct aio_kiocb *iocb, struct iocb *uiocb, bool compat)
 {
 	struct address_space *mapping = iocb->common.ki_filp->f_mapping;
 	pgoff_t index, end;
 	loff_t epos, isize;
 	int do_io = 0;
+	size_t len;
 
+	if (!aio_may_use_threads())
+		return -EINVAL;
+	if (uiocb->aio_buf)
+		return -EINVAL;
 	if (!mapping || !mapping->a_ops)
 		return -EBADF;
 	if (!mapping->a_ops->readpage && !mapping->a_ops->readpages)
 		return -EBADF;
+	len = uiocb->aio_nbytes;
 	if (!len)
 		return 0;
 
@@ -1958,7 +1960,6 @@ static long aio_thread_op_renameat(struct aio_kiocb *iocb)
 	unsigned flags;
 	long ret;
 
-	use_mm(aio_get_mm(&iocb->common));
 	if (unlikely(copy_from_user(&info, user_info, sizeof(info)))) {
 		ret = -EFAULT;
 		goto done;
@@ -1979,39 +1980,47 @@ static long aio_thread_op_renameat(struct aio_kiocb *iocb)
 	else
 		ret = sys_renameat2(olddir, old, newdir, new, flags);
 done:
-	unuse_mm(aio_get_mm(&iocb->common));
 	return ret;
 }
 
-long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb)
+static long aio_rename(struct aio_kiocb *iocb, struct iocb *user_iocb, bool c)
 {
-	const void * __user user_info;
-
 	if (user_iocb->aio_nbytes != sizeof(struct renameat_info))
 		return -EINVAL;
 	if (user_iocb->aio_offset)
 		return -EINVAL;
 
-	user_info = (const void * __user)(long)user_iocb->aio_buf;
-	if (unlikely(!access_ok(VERIFY_READ, user_info,
-				sizeof(struct renameat_info))))
-		return -EFAULT;
-
-	iocb->common.private = (void *)user_info;
+	iocb->common.private = (void *)(long)user_iocb->aio_buf;
 	return aio_thread_queue_iocb(iocb, aio_thread_op_renameat,
 				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
 				     AIO_THREAD_NEED_FS |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
+long aio_fsync(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
+{
+	bool datasync = (user_iocb->aio_lio_opcode == IOCB_CMD_FDSYNC);
+	struct file *file = req->common.ki_filp;
+
+	if (file->f_op->aio_fsync)
+		return file->f_op->aio_fsync(&req->common, datasync);
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	if (file->f_op->fsync) {
+		req->ki_data = datasync;
+		return aio_thread_queue_iocb(req, aio_thread_op_fsync, 0);
+	}
+#endif
+	return -EINVAL;
+}
+
 /*
- * aio_run_iocb:
- *	Performs the initial checks and io submission.
+ * aio_rw:
+ *	Implements read/write vectored and non-vectored
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
-			    bool compat)
+static long aio_rw(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
@@ -2085,70 +2094,42 @@ rw_common:
 			file_end_write(file);
 		break;
 
-	case IOCB_CMD_FDSYNC:
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(&req->common, 1);
-		else if (file->f_op->fsync && (aio_may_use_threads()))
-			ret = aio_fsync(&req->common, 1);
-		break;
-
-	case IOCB_CMD_FSYNC:
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(&req->common, 0);
-		else if (file->f_op->fsync && (aio_may_use_threads()))
-			ret = aio_fsync(&req->common, 0);
-		break;
-
-	case IOCB_CMD_POLL:
-		if (aio_may_use_threads())
-			ret = aio_poll(req);
-		break;
-
-	case IOCB_CMD_OPENAT:
-		if (aio_may_use_threads())
-			ret = aio_foo_at(req, aio_do_openat);
-		break;
-
-	case IOCB_CMD_UNLINKAT:
-		if (aio_may_use_threads())
-			ret = aio_foo_at(req, aio_do_unlinkat);
-		break;
-
-	case IOCB_CMD_READAHEAD:
-		if (user_iocb->aio_buf)
-			return -EINVAL;
-		if (aio_may_use_threads())
-			ret = aio_readahead(req, user_iocb->aio_nbytes);
-		break;
-
-	case IOCB_CMD_RENAMEAT:
-		if (aio_may_use_threads())
-			ret = aio_renameat(req, user_iocb);
-		break;
-
 	default:
 		pr_debug("EINVAL: no operation provided\n");
-		return -EINVAL;
 	}
+	return ret;
+}
 
-	if (ret != -EIOCBQUEUED) {
-		/*
-		 * There's no easy way to restart the syscall since other AIO's
-		 * may be already running. Just fail this IO with EINTR.
-		 */
-		if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
-			     ret == -ERESTARTNOHAND ||
-			     ret == -ERESTART_RESTARTBLOCK))
-			ret = -EINTR;
-		aio_complete(&req->common, ret, 0);
-	}
+typedef long (*aio_submit_fn_t)(struct aio_kiocb *req, struct iocb *iocb,
+				bool compat);
 
-	return 0;
-}
+#define NEED_FD			0x0001
+
+struct submit_info {
+	aio_submit_fn_t		fn;
+	unsigned long		flags;
+};
+
+static const struct submit_info aio_submit_info[] = {
+	[IOCB_CMD_PREAD]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PWRITE]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PREADV]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PWRITEV]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_FSYNC]	= { aio_fsync,	NEED_FD },
+	[IOCB_CMD_FDSYNC]	= { aio_fsync,	NEED_FD },
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	[IOCB_CMD_POLL]		= { aio_poll,	NEED_FD },
+	[IOCB_CMD_OPENAT]	= { aio_openat,	0 },
+	[IOCB_CMD_UNLINKAT]	= { aio_unlink,	0 },
+	[IOCB_CMD_READAHEAD]	= { aio_ra,	NEED_FD },
+	[IOCB_CMD_RENAMEAT]	= { aio_rename,	0 },
+#endif
+};
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
+	const struct submit_info *submit_info;
 	struct aio_kiocb *req;
 	ssize_t ret;
 
@@ -2168,23 +2149,26 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		return -EINVAL;
 	}
 
+	if (unlikely(iocb->aio_lio_opcode >= ARRAY_SIZE(aio_submit_info)))
+		return -EINVAL;
+	submit_info = &aio_submit_info[iocb->aio_lio_opcode];
+	if (unlikely(!submit_info->fn))
+		return -EINVAL;
+
 	req = aio_get_req(ctx);
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
-		req->common.ki_filp = NULL;
-	else {
+	if (submit_info->flags & NEED_FD) {
 		req->common.ki_filp = fget(iocb->aio_fildes);
 		if (unlikely(!req->common.ki_filp)) {
 			ret = -EBADF;
 			goto out_put_req;
 		}
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	if (req->common.ki_filp)
-		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -2212,10 +2196,20 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb, compat);
-	if (ret)
-		goto out_put_req;
-
+	ret = submit_info->fn(req, iocb, compat);
+	if (ret != -EIOCBQUEUED) {
+		/*
+		 * There's no easy way to restart the syscall since other AIO's
+		 * may be already running. Just fail this IO with EINTR.
+		 */
+		if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+			     ret == -ERESTARTNOHAND ||
+			     ret == -ERESTART_RESTARTBLOCK))
+			ret = -EINTR;
+		else if (IS_ERR_VALUE(ret))
+			goto out_put_req;
+		aio_complete(&req->common, ret, 0);
+	}
 	return 0;
 out_put_req:
 	put_reqs_available(ctx, 1);
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12 22:50                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-12 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.

That is one of my worries, and one of the reasons an async getdents64() 
or readdir() operation isn't in this batch -- there are a ton of ABI 
issues glibc handles on some platforms.

> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.
> 
>                   Linus

I think some cleanups can be made on how and where the AIO operations
are implemented.  A first stab is below (not very tested as of yet,
still have more work to do) that uses an array to dispatch AIO submits.
By using function pointers to dispatch the operations fairly early in
the process, the code that actually does the required verifications is
less spread out and much easier to follow instead of the giant select
cases.

Another possible improvement might be to move things like aio_fsync()
into sync.c with all the other relevant sync code.  That would make much
more sense and make it much more obvious as to which subsystem
maintainers a given set of functionality really belongs.  If that sounds
like an improvement, I can put some effort into that as well.

		-ben

 aio.c |  242 ++++++++++++++++++++++++++++++++----------------------------------
 1 file changed, 118 insertions(+), 124 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f776dff..0c06e3b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -177,6 +177,12 @@ typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
  */
 #define KIOCB_CANCELLED		((void *) (~0ULL))
 
+#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
+#define AIO_THREAD_NEED_MM	0x0010	/* Need the mm context */
+
 struct aio_kiocb {
 	struct kiocb		common;
 
@@ -205,6 +211,7 @@ struct aio_kiocb {
 	struct task_struct	*ki_cancel_task;
 	unsigned long		ki_data;
 	unsigned long		ki_rlimit_fsize;
+	unsigned		ki_thread_flags;	/* AIO_THREAD_NEED... */
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
 	struct fs_struct	*ki_fs;
@@ -231,16 +238,8 @@ static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
-ssize_t aio_fsync(struct kiocb *iocb, int datasync);
-long aio_poll(struct aio_kiocb *iocb);
 
 typedef long (*do_foo_at_t)(int fd, const char *filename, int flags, int mode);
-long aio_do_openat(int fd, const char *filename, int flags, int mode);
-long aio_do_unlinkat(int fd, const char *filename, int flags, int mode);
-long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at);
-
-long aio_readahead(struct aio_kiocb *iocb, unsigned long len);
-long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1533,9 +1532,13 @@ static void aio_thread_fn(struct work_struct *work)
 	old_cancel = cmpxchg(&iocb->ki_cancel,
 			     aio_thread_queue_iocb_cancel_early,
 			     aio_thread_queue_iocb_cancel);
-	if (old_cancel != KIOCB_CANCELLED)
+	if (old_cancel != KIOCB_CANCELLED) {
+		if (iocb->ki_thread_flags & AIO_THREAD_NEED_MM)
+			use_mm(iocb->ki_ctx->mm);
 		ret = iocb->ki_work_fn(iocb);
-	else
+		if (iocb->ki_thread_flags & AIO_THREAD_NEED_MM)
+			unuse_mm(iocb->ki_ctx->mm);
+	} else
 		ret = -EINTR;
 
 	current->kiocb = NULL;
@@ -1566,11 +1569,6 @@ static void aio_thread_fn(struct work_struct *work)
 		flush_signals(current);
 }
 
-#define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
-#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
-#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
-#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
-
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
  *	aio_kiocb for cancellation.  The caller must provide a function to
@@ -1581,7 +1579,10 @@ static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 				     aio_thread_work_fn_t work_fn,
 				     unsigned flags)
 {
+	if (!aio_may_use_threads())
+		return -EINVAL;
 	INIT_WORK(&iocb->ki_work, aio_thread_fn);
+	iocb->ki_thread_flags = flags;
 	iocb->ki_work_fn = work_fn;
 	if (flags & AIO_THREAD_NEED_TASK) {
 		iocb->ki_submit_task = current;
@@ -1618,7 +1619,6 @@ static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
 	struct file *filp;
 	long ret;
 
-	use_mm(iocb->ki_ctx->mm);
 	filp = iocb->common.ki_filp;
 
 	if (filp->f_op->read_iter) {
@@ -1633,7 +1633,6 @@ static long aio_thread_op_read_iter(struct aio_kiocb *iocb)
 					   filp->f_op->read);
 	else
 		ret = -EINVAL;
-	unuse_mm(iocb->ki_ctx->mm);
 	return ret;
 }
 
@@ -1656,7 +1655,7 @@ ssize_t generic_async_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		return -EINVAL;
 
 	return aio_thread_queue_iocb(req, aio_thread_op_read_iter,
-				     AIO_THREAD_NEED_TASK);
+				     AIO_THREAD_NEED_TASK | AIO_THREAD_NEED_MM);
 }
 EXPORT_SYMBOL(generic_async_read_iter);
 
@@ -1666,7 +1665,6 @@ static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
 	struct file *filp;
 	long ret;
 
-	use_mm(iocb->ki_ctx->mm);
 	filp = iocb->common.ki_filp;
 	saved_rlim_fsize = rlimit(RLIMIT_FSIZE);
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = iocb->ki_rlimit_fsize;
@@ -1684,7 +1682,6 @@ static long aio_thread_op_write_iter(struct aio_kiocb *iocb)
 	else
 		ret = -EINVAL;
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = saved_rlim_fsize;
-	unuse_mm(iocb->ki_ctx->mm);
 	return ret;
 }
 
@@ -1708,28 +1705,13 @@ ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	req->ki_rlimit_fsize = rlimit(RLIMIT_FSIZE);
 
 	return aio_thread_queue_iocb(req, aio_thread_op_write_iter,
-				     AIO_THREAD_NEED_TASK);
+				     AIO_THREAD_NEED_TASK | AIO_THREAD_NEED_MM);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
 
 static long aio_thread_op_fsync(struct aio_kiocb *iocb)
 {
-	return vfs_fsync(iocb->common.ki_filp, 0);
-}
-
-static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
-{
-	return vfs_fsync(iocb->common.ki_filp, 1);
-}
-
-ssize_t aio_fsync(struct kiocb *iocb, int datasync)
-{
-	struct aio_kiocb *req;
-
-	req = container_of(iocb, struct aio_kiocb, common);
-
-	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
-						   : aio_thread_op_fsync, 0);
+	return vfs_fsync(iocb->common.ki_filp, iocb->ki_data);
 }
 
 static long aio_thread_op_poll(struct aio_kiocb *iocb)
@@ -1766,17 +1748,22 @@ static long aio_thread_op_poll(struct aio_kiocb *iocb)
 	return ret;
 }
 
-long aio_poll(struct aio_kiocb *req)
+static long aio_poll(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 {
+	if (!req->common.ki_filp->f_op->poll)
+		return -EINVAL;
+	if ((unsigned short)user_iocb->aio_buf != user_iocb->aio_buf)
+		return -EINVAL;
+	req->ki_data = user_iocb->aio_buf;
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-long aio_do_openat(int fd, const char *filename, int flags, int mode)
+static long aio_do_openat(int fd, const char *filename, int flags, int mode)
 {
 	return do_sys_open(fd, filename, flags, mode);
 }
 
-long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
+static long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
 {
 	if (flags || mode)
 		return -EINVAL;
@@ -1789,7 +1776,6 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 	long ret;
 	u32 fd;
 
-	use_mm(req->ki_ctx->mm);
 	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
 		ret = -EFAULT;
 	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
@@ -1804,15 +1790,25 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 				(int)offset,
 				(unsigned short)(offset >> 32));
 	}
-	unuse_mm(req->ki_ctx->mm);
 	return ret;
 }
 
-long aio_foo_at(struct aio_kiocb *req, do_foo_at_t do_foo_at)
+static long aio_openat(struct aio_kiocb *req, struct iocb *uiocb, bool compat)
 {
-	req->ki_data = (unsigned long)(void *)do_foo_at;
+	req->ki_data = (unsigned long)(void *)aio_do_openat;
 	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
 				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
+
+static long aio_unlink(struct aio_kiocb *req, struct iocb *uiocb, bool compt)
+{
+	req->ki_data = (unsigned long)(void *)aio_do_unlinkat;
+	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
@@ -1898,17 +1894,23 @@ static long aio_thread_op_readahead(struct aio_kiocb *iocb)
 	return 0;
 }
 
-long aio_readahead(struct aio_kiocb *iocb, unsigned long len)
+static long aio_ra(struct aio_kiocb *iocb, struct iocb *uiocb, bool compat)
 {
 	struct address_space *mapping = iocb->common.ki_filp->f_mapping;
 	pgoff_t index, end;
 	loff_t epos, isize;
 	int do_io = 0;
+	size_t len;
 
+	if (!aio_may_use_threads())
+		return -EINVAL;
+	if (uiocb->aio_buf)
+		return -EINVAL;
 	if (!mapping || !mapping->a_ops)
 		return -EBADF;
 	if (!mapping->a_ops->readpage && !mapping->a_ops->readpages)
 		return -EBADF;
+	len = uiocb->aio_nbytes;
 	if (!len)
 		return 0;
 
@@ -1958,7 +1960,6 @@ static long aio_thread_op_renameat(struct aio_kiocb *iocb)
 	unsigned flags;
 	long ret;
 
-	use_mm(aio_get_mm(&iocb->common));
 	if (unlikely(copy_from_user(&info, user_info, sizeof(info)))) {
 		ret = -EFAULT;
 		goto done;
@@ -1979,39 +1980,47 @@ static long aio_thread_op_renameat(struct aio_kiocb *iocb)
 	else
 		ret = sys_renameat2(olddir, old, newdir, new, flags);
 done:
-	unuse_mm(aio_get_mm(&iocb->common));
 	return ret;
 }
 
-long aio_renameat(struct aio_kiocb *iocb, struct iocb *user_iocb)
+static long aio_rename(struct aio_kiocb *iocb, struct iocb *user_iocb, bool c)
 {
-	const void * __user user_info;
-
 	if (user_iocb->aio_nbytes != sizeof(struct renameat_info))
 		return -EINVAL;
 	if (user_iocb->aio_offset)
 		return -EINVAL;
 
-	user_info = (const void * __user)(long)user_iocb->aio_buf;
-	if (unlikely(!access_ok(VERIFY_READ, user_info,
-				sizeof(struct renameat_info))))
-		return -EFAULT;
-
-	iocb->common.private = (void *)user_info;
+	iocb->common.private = (void *)(long)user_iocb->aio_buf;
 	return aio_thread_queue_iocb(iocb, aio_thread_op_renameat,
 				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_MM |
 				     AIO_THREAD_NEED_FS |
 				     AIO_THREAD_NEED_FILES |
 				     AIO_THREAD_NEED_CRED);
 }
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
+long aio_fsync(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
+{
+	bool datasync = (user_iocb->aio_lio_opcode == IOCB_CMD_FDSYNC);
+	struct file *file = req->common.ki_filp;
+
+	if (file->f_op->aio_fsync)
+		return file->f_op->aio_fsync(&req->common, datasync);
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	if (file->f_op->fsync) {
+		req->ki_data = datasync;
+		return aio_thread_queue_iocb(req, aio_thread_op_fsync, 0);
+	}
+#endif
+	return -EINVAL;
+}
+
 /*
- * aio_run_iocb:
- *	Performs the initial checks and io submission.
+ * aio_rw:
+ *	Implements read/write vectored and non-vectored
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
-			    bool compat)
+static long aio_rw(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
@@ -2085,70 +2094,42 @@ rw_common:
 			file_end_write(file);
 		break;
 
-	case IOCB_CMD_FDSYNC:
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(&req->common, 1);
-		else if (file->f_op->fsync && (aio_may_use_threads()))
-			ret = aio_fsync(&req->common, 1);
-		break;
-
-	case IOCB_CMD_FSYNC:
-		if (file->f_op->aio_fsync)
-			ret = file->f_op->aio_fsync(&req->common, 0);
-		else if (file->f_op->fsync && (aio_may_use_threads()))
-			ret = aio_fsync(&req->common, 0);
-		break;
-
-	case IOCB_CMD_POLL:
-		if (aio_may_use_threads())
-			ret = aio_poll(req);
-		break;
-
-	case IOCB_CMD_OPENAT:
-		if (aio_may_use_threads())
-			ret = aio_foo_at(req, aio_do_openat);
-		break;
-
-	case IOCB_CMD_UNLINKAT:
-		if (aio_may_use_threads())
-			ret = aio_foo_at(req, aio_do_unlinkat);
-		break;
-
-	case IOCB_CMD_READAHEAD:
-		if (user_iocb->aio_buf)
-			return -EINVAL;
-		if (aio_may_use_threads())
-			ret = aio_readahead(req, user_iocb->aio_nbytes);
-		break;
-
-	case IOCB_CMD_RENAMEAT:
-		if (aio_may_use_threads())
-			ret = aio_renameat(req, user_iocb);
-		break;
-
 	default:
 		pr_debug("EINVAL: no operation provided\n");
-		return -EINVAL;
 	}
+	return ret;
+}
 
-	if (ret != -EIOCBQUEUED) {
-		/*
-		 * There's no easy way to restart the syscall since other AIO's
-		 * may be already running. Just fail this IO with EINTR.
-		 */
-		if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
-			     ret == -ERESTARTNOHAND ||
-			     ret == -ERESTART_RESTARTBLOCK))
-			ret = -EINTR;
-		aio_complete(&req->common, ret, 0);
-	}
+typedef long (*aio_submit_fn_t)(struct aio_kiocb *req, struct iocb *iocb,
+				bool compat);
 
-	return 0;
-}
+#define NEED_FD			0x0001
+
+struct submit_info {
+	aio_submit_fn_t		fn;
+	unsigned long		flags;
+};
+
+static const struct submit_info aio_submit_info[] = {
+	[IOCB_CMD_PREAD]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PWRITE]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PREADV]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_PWRITEV]	= { aio_rw,	NEED_FD },
+	[IOCB_CMD_FSYNC]	= { aio_fsync,	NEED_FD },
+	[IOCB_CMD_FDSYNC]	= { aio_fsync,	NEED_FD },
+#if IS_ENABLED(CONFIG_AIO_THREAD)
+	[IOCB_CMD_POLL]		= { aio_poll,	NEED_FD },
+	[IOCB_CMD_OPENAT]	= { aio_openat,	0 },
+	[IOCB_CMD_UNLINKAT]	= { aio_unlink,	0 },
+	[IOCB_CMD_READAHEAD]	= { aio_ra,	NEED_FD },
+	[IOCB_CMD_RENAMEAT]	= { aio_rename,	0 },
+#endif
+};
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
+	const struct submit_info *submit_info;
 	struct aio_kiocb *req;
 	ssize_t ret;
 
@@ -2168,23 +2149,26 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		return -EINVAL;
 	}
 
+	if (unlikely(iocb->aio_lio_opcode >= ARRAY_SIZE(aio_submit_info)))
+		return -EINVAL;
+	submit_info = &aio_submit_info[iocb->aio_lio_opcode];
+	if (unlikely(!submit_info->fn))
+		return -EINVAL;
+
 	req = aio_get_req(ctx);
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
-		req->common.ki_filp = NULL;
-	else {
+	if (submit_info->flags & NEED_FD) {
 		req->common.ki_filp = fget(iocb->aio_fildes);
 		if (unlikely(!req->common.ki_filp)) {
 			ret = -EBADF;
 			goto out_put_req;
 		}
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	if (req->common.ki_filp)
-		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -2212,10 +2196,20 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb, compat);
-	if (ret)
-		goto out_put_req;
-
+	ret = submit_info->fn(req, iocb, compat);
+	if (ret != -EIOCBQUEUED) {
+		/*
+		 * There's no easy way to restart the syscall since other AIO's
+		 * may be already running. Just fail this IO with EINTR.
+		 */
+		if (unlikely(ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+			     ret == -ERESTARTNOHAND ||
+			     ret == -ERESTART_RESTARTBLOCK))
+			ret = -EINTR;
+		else if (IS_ERR_VALUE(ret))
+			goto out_put_req;
+		aio_complete(&req->common, ret, 0);
+	}
 	return 0;
 out_put_req:
 	put_reqs_available(ctx, 1);
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  4:03               ` Linus Torvalds
  (?)
@ 2016-01-12 22:59                 ` Andy Lutomirski
  -1 siblings, 0 replies; 133+ messages in thread
From: Andy Lutomirski @ 2016-01-12 22:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

On Jan 11, 2016 8:04 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
>
> On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Yes, I heard you the first time, but you haven't acknowledged that
> > the aio fsync interface is indeed different because it already
> > exists. What's the problem with implementing an AIO call that we've
> > advertised as supported for many years now that people are asking us
> > to implement it?
>
> Oh, I don't disagree with that. I think it should be exposed, my point
> was that that too was not enough.
>
> I don't see why you argue. You said "that's not enough". And I jjust
> said that your expansion wasn't sufficient either, and that I think we
> should strive to expand things even more.
>
> And preferably not in some ad-hoc manner. Expand it to *everything* we can do.
>
> > As for a generic async syscall interface, why not just add
> > IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> > into the iovec structure and let the existing aio subsystem handle
> > demultiplexing it and handing them off to threads/workqueues/etc?
>
> That would likely be the simplest approach, yes.
>
> There's a few arguments against it, though:
>
>  - doing the indirect system call thing does end up being
> architecture-specific, so now you do need the AIO code to call into
> some arch wrapper.

How many arches *can* do it?  As of 4.4, x86_32 can, but x86_64 can't
yet.  We'd also need a whitelist of acceptable indirect syscalls (e.g.
exit is bad).  And we have to worry about things that depend on the mm
or creds.

It would be extra nice if we could avoid switch_mm for things that
don't need it (fsync) and only do it for things like read that do.

--Andy

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12 22:59                 ` Andy Lutomirski
  0 siblings, 0 replies; 133+ messages in thread
From: Andy Lutomirski @ 2016-01-12 22:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

On Jan 11, 2016 8:04 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
>
> On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Yes, I heard you the first time, but you haven't acknowledged that
> > the aio fsync interface is indeed different because it already
> > exists. What's the problem with implementing an AIO call that we've
> > advertised as supported for many years now that people are asking us
> > to implement it?
>
> Oh, I don't disagree with that. I think it should be exposed, my point
> was that that too was not enough.
>
> I don't see why you argue. You said "that's not enough". And I jjust
> said that your expansion wasn't sufficient either, and that I think we
> should strive to expand things even more.
>
> And preferably not in some ad-hoc manner. Expand it to *everything* we can do.
>
> > As for a generic async syscall interface, why not just add
> > IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> > into the iovec structure and let the existing aio subsystem handle
> > demultiplexing it and handing them off to threads/workqueues/etc?
>
> That would likely be the simplest approach, yes.
>
> There's a few arguments against it, though:
>
>  - doing the indirect system call thing does end up being
> architecture-specific, so now you do need the AIO code to call into
> some arch wrapper.

How many arches *can* do it?  As of 4.4, x86_32 can, but x86_64 can't
yet.  We'd also need a whitelist of acceptable indirect syscalls (e.g.
exit is bad).  And we have to worry about things that depend on the mm
or creds.

It would be extra nice if we could avoid switch_mm for things that
don't need it (fsync) and only do it for things like read that do.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-12 22:59                 ` Andy Lutomirski
  0 siblings, 0 replies; 133+ messages in thread
From: Andy Lutomirski @ 2016-01-12 22:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

On Jan 11, 2016 8:04 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
>
> On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Yes, I heard you the first time, but you haven't acknowledged that
> > the aio fsync interface is indeed different because it already
> > exists. What's the problem with implementing an AIO call that we've
> > advertised as supported for many years now that people are asking us
> > to implement it?
>
> Oh, I don't disagree with that. I think it should be exposed, my point
> was that that too was not enough.
>
> I don't see why you argue. You said "that's not enough". And I jjust
> said that your expansion wasn't sufficient either, and that I think we
> should strive to expand things even more.
>
> And preferably not in some ad-hoc manner. Expand it to *everything* we can do.
>
> > As for a generic async syscall interface, why not just add
> > IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> > into the iovec structure and let the existing aio subsystem handle
> > demultiplexing it and handing them off to threads/workqueues/etc?
>
> That would likely be the simplest approach, yes.
>
> There's a few arguments against it, though:
>
>  - doing the indirect system call thing does end up being
> architecture-specific, so now you do need the AIO code to call into
> some arch wrapper.

How many arches *can* do it?  As of 4.4, x86_32 can, but x86_64 can't
yet.  We'd also need a whitelist of acceptable indirect syscalls (e.g.
exit is bad).  And we have to worry about things that depend on the mm
or creds.

It would be extra nice if we could avoid switch_mm for things that
don't need it (fsync) and only do it for things like read that do.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  1:20       ` Linus Torvalds
  (?)
@ 2016-01-14  9:19         ` Paolo Bonzini
  -1 siblings, 0 replies; 133+ messages in thread
From: Paolo Bonzini @ 2016-01-14  9:19 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton



On 12/01/2016 02:20, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Insufficient. Needs the range to be passed through and call
>> vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.
> 
> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.
> 
> For example, if you are doing your own manual write-behind logic, it
> is not sufficient for "wait for data". What you want is "start IO on
> new data" followed by "wait for old data to have been written out".
> 
> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

The question is, do we really want something more generic than the usual
AIO?

Virt is one of the 10 (that's a binary number) users of AIO, and we
don't even use it by default because in most cases it's really a wash.

Let's compare AIO with a simple userspace thread pool.

AIO has the ability to submit and retrieve the results of multiple
operations at once.  Thread pools do not have the ability to submit
multiple operations at a time (you could play games with FUTEX_WAKE, but
then all the threads in the pool would have cacheline bounces on the futex).

The syscall overhead on the critical path is comparable.  For AIO it's
io_submit+io_getevents, for a thread pool it's FUTEX_WAKE plus invoking
the actual syscall.  Again, the only difference for AIO is batching.

Unless userspace is submitting tens of thousands of operations per
second, which is pretty much the case only for read/write, there's no
real benefit in asynchronous system calls over a userspace thread pool.
 That applies to openat, unlinkat, fadvise (for readahead).  It also
applies to msync and fsync, etc. because if your workload is doing tons
of those you'd better buy yourself a disk with a battery-backed cache,
or an UPS, and remove the msync/fsync altogether.

So I'm really happy if we can move the thread creation overhead for such
a thread pool to the kernel.  It keeps the benefits of batching, it uses
the optimized kernel workqueues, it doesn't incur the cost of pthreads,
it makes it easy to remove the cases where AIO is blocking, it makes it
easy to add support for !O_DIRECT.  But everything else seems overkill.

Paolo

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-14  9:19         ` Paolo Bonzini
  0 siblings, 0 replies; 133+ messages in thread
From: Paolo Bonzini @ 2016-01-14  9:19 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton



On 12/01/2016 02:20, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Insufficient. Needs the range to be passed through and call
>> vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.
> 
> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.
> 
> For example, if you are doing your own manual write-behind logic, it
> is not sufficient for "wait for data". What you want is "start IO on
> new data" followed by "wait for old data to have been written out".
> 
> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

The question is, do we really want something more generic than the usual
AIO?

Virt is one of the 10 (that's a binary number) users of AIO, and we
don't even use it by default because in most cases it's really a wash.

Let's compare AIO with a simple userspace thread pool.

AIO has the ability to submit and retrieve the results of multiple
operations at once.  Thread pools do not have the ability to submit
multiple operations at a time (you could play games with FUTEX_WAKE, but
then all the threads in the pool would have cacheline bounces on the futex).

The syscall overhead on the critical path is comparable.  For AIO it's
io_submit+io_getevents, for a thread pool it's FUTEX_WAKE plus invoking
the actual syscall.  Again, the only difference for AIO is batching.

Unless userspace is submitting tens of thousands of operations per
second, which is pretty much the case only for read/write, there's no
real benefit in asynchronous system calls over a userspace thread pool.
 That applies to openat, unlinkat, fadvise (for readahead).  It also
applies to msync and fsync, etc. because if your workload is doing tons
of those you'd better buy yourself a disk with a battery-backed cache,
or an UPS, and remove the msync/fsync altogether.

So I'm really happy if we can move the thread creation overhead for such
a thread pool to the kernel.  It keeps the benefits of batching, it uses
the optimized kernel workqueues, it doesn't incur the cost of pthreads,
it makes it easy to remove the cases where AIO is blocking, it makes it
easy to add support for !O_DIRECT.  But everything else seems overkill.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-14  9:19         ` Paolo Bonzini
  0 siblings, 0 replies; 133+ messages in thread
From: Paolo Bonzini @ 2016-01-14  9:19 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton



On 12/01/2016 02:20, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Insufficient. Needs the range to be passed through and call
>> vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.
> 
> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.
> 
> For example, if you are doing your own manual write-behind logic, it
> is not sufficient for "wait for data". What you want is "start IO on
> new data" followed by "wait for old data to have been written out".
> 
> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

The question is, do we really want something more generic than the usual
AIO?

Virt is one of the 10 (that's a binary number) users of AIO, and we
don't even use it by default because in most cases it's really a wash.

Let's compare AIO with a simple userspace thread pool.

AIO has the ability to submit and retrieve the results of multiple
operations at once.  Thread pools do not have the ability to submit
multiple operations at a time (you could play games with FUTEX_WAKE, but
then all the threads in the pool would have cacheline bounces on the futex).

The syscall overhead on the critical path is comparable.  For AIO it's
io_submit+io_getevents, for a thread pool it's FUTEX_WAKE plus invoking
the actual syscall.  Again, the only difference for AIO is batching.

Unless userspace is submitting tens of thousands of operations per
second, which is pretty much the case only for read/write, there's no
real benefit in asynchronous system calls over a userspace thread pool.
 That applies to openat, unlinkat, fadvise (for readahead).  It also
applies to msync and fsync, etc. because if your workload is doing tons
of those you'd better buy yourself a disk with a battery-backed cache,
or an UPS, and remove the msync/fsync altogether.

So I'm really happy if we can move the thread creation overhead for such
a thread pool to the kernel.  It keeps the benefits of batching, it uses
the optimized kernel workqueues, it doesn't incur the cost of pthreads,
it makes it easy to remove the cases where AIO is blocking, it makes it
easy to add support for !O_DIRECT.  But everything else seems overkill.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  4:48                 ` Linus Torvalds
  (?)
@ 2016-01-15 20:21                   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-15 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.
> 
> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.

Any further thoughts on this after a few days worth of pondering?

		-ben

>                   Linus

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-15 20:21                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-15 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.
> 
> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.

Any further thoughts on this after a few days worth of pondering?

		-ben

>                   Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-15 20:21                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-15 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.
> 
> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.

Any further thoughts on this after a few days worth of pondering?

		-ben

>                   Linus

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-15 20:21                   ` Benjamin LaHaise
  (?)
@ 2016-01-20  3:59                     ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-20  3:59 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>>
>> I'll have to think about this some more.
>
> Any further thoughts on this after a few days worth of pondering?

Sorry about the delay, with the merge window and me being sick for a
couple of days I didn't get around to this.

After thinking it over some more, I guess I'm ok with your approach.
The table-driven patch makes me a bit happier, and I guess not very
many people end up ever wanting to do async system calls anyway.

Are there other users outside of Solace? It would be good to get comments..

            Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20  3:59                     ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-20  3:59 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>>
>> I'll have to think about this some more.
>
> Any further thoughts on this after a few days worth of pondering?

Sorry about the delay, with the merge window and me being sick for a
couple of days I didn't get around to this.

After thinking it over some more, I guess I'm ok with your approach.
The table-driven patch makes me a bit happier, and I guess not very
many people end up ever wanting to do async system calls anyway.

Are there other users outside of Solace? It would be good to get comments..

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20  3:59                     ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-20  3:59 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>>
>> I'll have to think about this some more.
>
> Any further thoughts on this after a few days worth of pondering?

Sorry about the delay, with the merge window and me being sick for a
couple of days I didn't get around to this.

After thinking it over some more, I guess I'm ok with your approach.
The table-driven patch makes me a bit happier, and I guess not very
many people end up ever wanting to do async system calls anyway.

Are there other users outside of Solace? It would be good to get comments..

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20  3:59                     ` Linus Torvalds
  (?)
@ 2016-01-20  5:02                       ` Theodore Ts'o
  -1 siblings, 0 replies; 133+ messages in thread
From: Theodore Ts'o @ 2016-01-20  5:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

For async I/O?  We're using it inside Google, for networking and for
storage I/O's.  We don't need async fsync/fdatasync, but we do need
very fast, low overhead I/O's.  To that end, we have some patches to
batch block layer completion handling, which Kent tried upstreaming a
few years back but which everyone thought was too ugly to live.

(It *was* ugly, but we had access to some very fast storage devices
where it really mattered.  With upcoming NVMe devices, that sort of
hardware should be available to more folks, so it's something that
I've been meaning to revisit from an upstreaming perspective,
especially if I can get my hands on some publically available hardware
for benchmarking purposes to demonstrate why it's useful, even if it
is ugly.)

The other thing which we have which is a bit more experimental is that
we've plumbed through the aio priority bits to the block layer, as
well as aio_cancel.  The idea for the latter is if you are are
interested in low latency access to a clustered file system, where
sometimes a read request can get stuck behind other I/O requests if a
server has a long queue of requests to service.  So the client for
which low latency is very important fires off the request to more than
one server, and as soon as it gets an answer it sends a "never mind"
message to the other server(s).

The code to do aio_cancellation in the block layer is fairly well
tested, and was in Kent's git trees, but never got formally pushed
upstream.  The code to push the cancellation request all the way to
the HDD (for those hard disks / storage devices that support I/O
cancellation) is even more experimental, and needs a lot of cleanup
before it could be sent for review (it was done by someone who isn't
used to upstream coding standards).

The reason why we haven't tried to pushed more of these changes
upsream has been lack of resources, and the fact that the AIO code
*is* ugly, which means extensions tend to make the code at the very
least, more complex.  Especially since some of the folks working on
it, such as Kent, were really worried about performance at all costs,
and Kerningham's "it's twice as hard to debug code as to write it"
comment really applies here.  And since very few people outside of
Google seem to use AIO, and even fewer seem eager to review or work on
AIO, and our team is quite small for the work we need to do, it just
hasn't risen to the top of the priority list.

Still, it's fair to say that if you are using Google Hangouts, or
Google Mail, or Google Docs, AIO is most definitely getting used to
process your queries.

As far as comments, aside from the "we really care about performance",
and "the code is scary complex and barely on the edge of being
maintainable", the other comment I'd make is libaio is pretty awful,
and so as a result a number (most?) of our AIO users have elected to
use the raw system call interfaces and are *not* using the libaio
abstractions --- which, as near as I can tell, don't really buy you
much anyway.  (Do we really need to keep code that provides backwards
compatibility with kernels over 10+ years old at this point?)

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20  5:02                       ` Theodore Ts'o
  0 siblings, 0 replies; 133+ messages in thread
From: Theodore Ts'o @ 2016-01-20  5:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

For async I/O?  We're using it inside Google, for networking and for
storage I/O's.  We don't need async fsync/fdatasync, but we do need
very fast, low overhead I/O's.  To that end, we have some patches to
batch block layer completion handling, which Kent tried upstreaming a
few years back but which everyone thought was too ugly to live.

(It *was* ugly, but we had access to some very fast storage devices
where it really mattered.  With upcoming NVMe devices, that sort of
hardware should be available to more folks, so it's something that
I've been meaning to revisit from an upstreaming perspective,
especially if I can get my hands on some publically available hardware
for benchmarking purposes to demonstrate why it's useful, even if it
is ugly.)

The other thing which we have which is a bit more experimental is that
we've plumbed through the aio priority bits to the block layer, as
well as aio_cancel.  The idea for the latter is if you are are
interested in low latency access to a clustered file system, where
sometimes a read request can get stuck behind other I/O requests if a
server has a long queue of requests to service.  So the client for
which low latency is very important fires off the request to more than
one server, and as soon as it gets an answer it sends a "never mind"
message to the other server(s).

The code to do aio_cancellation in the block layer is fairly well
tested, and was in Kent's git trees, but never got formally pushed
upstream.  The code to push the cancellation request all the way to
the HDD (for those hard disks / storage devices that support I/O
cancellation) is even more experimental, and needs a lot of cleanup
before it could be sent for review (it was done by someone who isn't
used to upstream coding standards).

The reason why we haven't tried to pushed more of these changes
upsream has been lack of resources, and the fact that the AIO code
*is* ugly, which means extensions tend to make the code at the very
least, more complex.  Especially since some of the folks working on
it, such as Kent, were really worried about performance at all costs,
and Kerningham's "it's twice as hard to debug code as to write it"
comment really applies here.  And since very few people outside of
Google seem to use AIO, and even fewer seem eager to review or work on
AIO, and our team is quite small for the work we need to do, it just
hasn't risen to the top of the priority list.

Still, it's fair to say that if you are using Google Hangouts, or
Google Mail, or Google Docs, AIO is most definitely getting used to
process your queries.

As far as comments, aside from the "we really care about performance",
and "the code is scary complex and barely on the edge of being
maintainable", the other comment I'd make is libaio is pretty awful,
and so as a result a number (most?) of our AIO users have elected to
use the raw system call interfaces and are *not* using the libaio
abstractions --- which, as near as I can tell, don't really buy you
much anyway.  (Do we really need to keep code that provides backwards
compatibility with kernels over 10+ years old at this point?)

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20  5:02                       ` Theodore Ts'o
  0 siblings, 0 replies; 133+ messages in thread
From: Theodore Ts'o @ 2016-01-20  5:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, linux-aio-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel, Linux Kernel Mailing List, Linux API, linux-mm,
	Alexander Viro, Andrew Morton

On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

For async I/O?  We're using it inside Google, for networking and for
storage I/O's.  We don't need async fsync/fdatasync, but we do need
very fast, low overhead I/O's.  To that end, we have some patches to
batch block layer completion handling, which Kent tried upstreaming a
few years back but which everyone thought was too ugly to live.

(It *was* ugly, but we had access to some very fast storage devices
where it really mattered.  With upcoming NVMe devices, that sort of
hardware should be available to more folks, so it's something that
I've been meaning to revisit from an upstreaming perspective,
especially if I can get my hands on some publically available hardware
for benchmarking purposes to demonstrate why it's useful, even if it
is ugly.)

The other thing which we have which is a bit more experimental is that
we've plumbed through the aio priority bits to the block layer, as
well as aio_cancel.  The idea for the latter is if you are are
interested in low latency access to a clustered file system, where
sometimes a read request can get stuck behind other I/O requests if a
server has a long queue of requests to service.  So the client for
which low latency is very important fires off the request to more than
one server, and as soon as it gets an answer it sends a "never mind"
message to the other server(s).

The code to do aio_cancellation in the block layer is fairly well
tested, and was in Kent's git trees, but never got formally pushed
upstream.  The code to push the cancellation request all the way to
the HDD (for those hard disks / storage devices that support I/O
cancellation) is even more experimental, and needs a lot of cleanup
before it could be sent for review (it was done by someone who isn't
used to upstream coding standards).

The reason why we haven't tried to pushed more of these changes
upsream has been lack of resources, and the fact that the AIO code
*is* ugly, which means extensions tend to make the code at the very
least, more complex.  Especially since some of the folks working on
it, such as Kent, were really worried about performance at all costs,
and Kerningham's "it's twice as hard to debug code as to write it"
comment really applies here.  And since very few people outside of
Google seem to use AIO, and even fewer seem eager to review or work on
AIO, and our team is quite small for the work we need to do, it just
hasn't risen to the top of the priority list.

Still, it's fair to say that if you are using Google Hangouts, or
Google Mail, or Google Docs, AIO is most definitely getting used to
process your queries.

As far as comments, aside from the "we really care about performance",
and "the code is scary complex and barely on the edge of being
maintainable", the other comment I'd make is libaio is pretty awful,
and so as a result a number (most?) of our AIO users have elected to
use the raw system call interfaces and are *not* using the libaio
abstractions --- which, as near as I can tell, don't really buy you
much anyway.  (Do we really need to keep code that provides backwards
compatibility with kernels over 10+ years old at this point?)

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20  3:59                     ` Linus Torvalds
  (?)
@ 2016-01-20 19:59                       ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >>
> >> I'll have to think about this some more.
> >
> > Any further thoughts on this after a few days worth of pondering?
> 
> Sorry about the delay, with the merge window and me being sick for a
> couple of days I didn't get around to this.
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

I know of quite a few storage/db products that use AIO. The most
recent high profile project that have been reporting issues with AIO
on XFS is http://www.scylladb.com/. That project is architected
around non-blocking AIO for scalability reasons...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 19:59                       ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >>
> >> I'll have to think about this some more.
> >
> > Any further thoughts on this after a few days worth of pondering?
> 
> Sorry about the delay, with the merge window and me being sick for a
> couple of days I didn't get around to this.
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

I know of quite a few storage/db products that use AIO. The most
recent high profile project that have been reporting issues with AIO
on XFS is http://www.scylladb.com/. That project is architected
around non-blocking AIO for scalability reasons...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 19:59                       ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel, Linux Kernel Mailing List, Linux API, linux-mm,
	Alexander Viro, Andrew Morton

On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org> wrote:
> >>
> >> I'll have to think about this some more.
> >
> > Any further thoughts on this after a few days worth of pondering?
> 
> Sorry about the delay, with the merge window and me being sick for a
> couple of days I didn't get around to this.
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

I know of quite a few storage/db products that use AIO. The most
recent high profile project that have been reporting issues with AIO
on XFS is http://www.scylladb.com/. That project is architected
around non-blocking AIO for scalability reasons...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 19:59                       ` Dave Chinner
@ 2016-01-20 20:29                         ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-20 20:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Are there other users outside of Solace? It would be good to get comments..
>
> I know of quite a few storage/db products that use AIO. The most
> recent high profile project that have been reporting issues with AIO
> on XFS is http://www.scylladb.com/. That project is architected
> around non-blocking AIO for scalability reasons...

I was more wondering about the new interfaces, making sure that the
feature set actually matches what people want to do..

That said, I also agree that it would be interesting to hear what the
performance impact is for existing performance-sensitive users. Could
we make that "aio_may_use_threads()" case be unconditional, making
things simpler?

          Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 20:29                         ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-01-20 20:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Are there other users outside of Solace? It would be good to get comments..
>
> I know of quite a few storage/db products that use AIO. The most
> recent high profile project that have been reporting issues with AIO
> on XFS is http://www.scylladb.com/. That project is architected
> around non-blocking AIO for scalability reasons...

I was more wondering about the new interfaces, making sure that the
feature set actually matches what people want to do..

That said, I also agree that it would be interesting to hear what the
performance impact is for existing performance-sensitive users. Could
we make that "aio_may_use_threads()" case be unconditional, making
things simpler?

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 20:29                         ` Linus Torvalds
  (?)
@ 2016-01-20 20:44                           ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-20 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

I suspect this will be an ongoing learning exercise as people start to use 
the new functionality and find gaps in terms of what is needed.  Certainly 
there is a bunch of stuff we need to add to cover the cases where disk i/o 
is required.  getdents() is one example, but the ABI issues we have with it 
are somewhat more complicated given the history associated with that 
interface.

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

Making it unconditional is a goal, but some work is required before that 
can be the case.  The O_DIRECT issue is one such matter -- it requires some 
changes to the filesystems to ensure that they adhere to the non-blocking 
nature of the new interface (ie taking i_mutex is a Bad Thing that users 
really do not want to be exposed to; if taking it blocks, the code should 
punt to a helper thread).  Additional auditing of some of the read/write 
implementations is also required, which will likely need some minor changes 
in things like sysfs and other weird functionality we have.  Having the 
flag reflects that while the functionality is useful, not all of the bugs 
have been worked out yet.

What's the desired approach to merge these changes?  Does it make sense 
to merge what is ready now and prepare the next round of changes for 4.6?  
Or is it more important to grow things to a more complete state before 
merging?

Regards,

		-ben

>           Linus

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 20:44                           ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-20 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

I suspect this will be an ongoing learning exercise as people start to use 
the new functionality and find gaps in terms of what is needed.  Certainly 
there is a bunch of stuff we need to add to cover the cases where disk i/o 
is required.  getdents() is one example, but the ABI issues we have with it 
are somewhat more complicated given the history associated with that 
interface.

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

Making it unconditional is a goal, but some work is required before that 
can be the case.  The O_DIRECT issue is one such matter -- it requires some 
changes to the filesystems to ensure that they adhere to the non-blocking 
nature of the new interface (ie taking i_mutex is a Bad Thing that users 
really do not want to be exposed to; if taking it blocks, the code should 
punt to a helper thread).  Additional auditing of some of the read/write 
implementations is also required, which will likely need some minor changes 
in things like sysfs and other weird functionality we have.  Having the 
flag reflects that while the functionality is useful, not all of the bugs 
have been worked out yet.

What's the desired approach to merge these changes?  Does it make sense 
to merge what is ready now and prepare the next round of changes for 4.6?  
Or is it more important to grow things to a more complete state before 
merging?

Regards,

		-ben

>           Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 20:44                           ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-20 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

I suspect this will be an ongoing learning exercise as people start to use 
the new functionality and find gaps in terms of what is needed.  Certainly 
there is a bunch of stuff we need to add to cover the cases where disk i/o 
is required.  getdents() is one example, but the ABI issues we have with it 
are somewhat more complicated given the history associated with that 
interface.

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

Making it unconditional is a goal, but some work is required before that 
can be the case.  The O_DIRECT issue is one such matter -- it requires some 
changes to the filesystems to ensure that they adhere to the non-blocking 
nature of the new interface (ie taking i_mutex is a Bad Thing that users 
really do not want to be exposed to; if taking it blocks, the code should 
punt to a helper thread).  Additional auditing of some of the read/write 
implementations is also required, which will likely need some minor changes 
in things like sysfs and other weird functionality we have.  Having the 
flag reflects that while the functionality is useful, not all of the bugs 
have been worked out yet.

What's the desired approach to merge these changes?  Does it make sense 
to merge what is ready now and prepare the next round of changes for 4.6?  
Or is it more important to grow things to a more complete state before 
merging?

Regards,

		-ben

>           Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 20:44                           ` Benjamin LaHaise
@ 2016-01-20 21:45                             ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 21:45 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 03:44:49PM -0500, Benjamin LaHaise wrote:
> On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> Are there other users outside of Solace? It would be good to get comments..
> > >
> > > I know of quite a few storage/db products that use AIO. The most
> > > recent high profile project that have been reporting issues with AIO
> > > on XFS is http://www.scylladb.com/. That project is architected
> > > around non-blocking AIO for scalability reasons...
> > 
> > I was more wondering about the new interfaces, making sure that the
> > feature set actually matches what people want to do..
> 
> I suspect this will be an ongoing learning exercise as people start to use 
> the new functionality and find gaps in terms of what is needed.  Certainly 
> there is a bunch of stuff we need to add to cover the cases where disk i/o 
> is required.  getdents() is one example, but the ABI issues we have with it 
> are somewhat more complicated given the history associated with that 
> interface.
> 
> > That said, I also agree that it would be interesting to hear what the
> > performance impact is for existing performance-sensitive users. Could
> > we make that "aio_may_use_threads()" case be unconditional, making
> > things simpler?
> 
> Making it unconditional is a goal, but some work is required before that 
> can be the case.  The O_DIRECT issue is one such matter -- it requires some 
> changes to the filesystems to ensure that they adhere to the non-blocking 
> nature of the new interface (ie taking i_mutex is a Bad Thing that users 
> really do not want to be exposed to; if taking it blocks, the code should 
> punt to a helper thread).

Filesystems *must take locks* in the IO path. We have to serialise
against truncate and other operations at some point in the IO path
(e.g. block mapping vs concurrent allocation and/or removal), and
that can only be done sanely with sleeping locks.  There is no way
of knowing in advance if we are going to block, and so either we
always use threads for IO submission or we accept that occasionally
the AIO submission will block.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 21:45                             ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 21:45 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 03:44:49PM -0500, Benjamin LaHaise wrote:
> On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> Are there other users outside of Solace? It would be good to get comments..
> > >
> > > I know of quite a few storage/db products that use AIO. The most
> > > recent high profile project that have been reporting issues with AIO
> > > on XFS is http://www.scylladb.com/. That project is architected
> > > around non-blocking AIO for scalability reasons...
> > 
> > I was more wondering about the new interfaces, making sure that the
> > feature set actually matches what people want to do..
> 
> I suspect this will be an ongoing learning exercise as people start to use 
> the new functionality and find gaps in terms of what is needed.  Certainly 
> there is a bunch of stuff we need to add to cover the cases where disk i/o 
> is required.  getdents() is one example, but the ABI issues we have with it 
> are somewhat more complicated given the history associated with that 
> interface.
> 
> > That said, I also agree that it would be interesting to hear what the
> > performance impact is for existing performance-sensitive users. Could
> > we make that "aio_may_use_threads()" case be unconditional, making
> > things simpler?
> 
> Making it unconditional is a goal, but some work is required before that 
> can be the case.  The O_DIRECT issue is one such matter -- it requires some 
> changes to the filesystems to ensure that they adhere to the non-blocking 
> nature of the new interface (ie taking i_mutex is a Bad Thing that users 
> really do not want to be exposed to; if taking it blocks, the code should 
> punt to a helper thread).

Filesystems *must take locks* in the IO path. We have to serialise
against truncate and other operations at some point in the IO path
(e.g. block mapping vs concurrent allocation and/or removal), and
that can only be done sanely with sleeping locks.  There is no way
of knowing in advance if we are going to block, and so either we
always use threads for IO submission or we accept that occasionally
the AIO submission will block.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 21:45                             ` Dave Chinner
  (?)
@ 2016-01-20 21:56                               ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-20 21:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> Filesystems *must take locks* in the IO path. We have to serialise
> against truncate and other operations at some point in the IO path
> (e.g. block mapping vs concurrent allocation and/or removal), and
> that can only be done sanely with sleeping locks.  There is no way
> of knowing in advance if we are going to block, and so either we
> always use threads for IO submission or we accept that occasionally
> the AIO submission will block.

I never said we don't take locks.  Still, we can be more intelligent 
about when and where we do so.  With the nonblocking pread() and pwrite() 
changes being proposed elsewhere, we can do the part of the I/O that 
doesn't block in the submitter, which is a huge win when possible.

As it stands today, *every* buffered write takes i_mutex immediately 
on entering ->write().  That one issue alone accounts for a nearly 10x 
performance difference between an O_SYNC write and an O_DIRECT write, 
and using O_SYNC writes is a legitimate use-case for users who want 
caching of data by the kernel (duplicating that functionality is a huge 
amount of work for an application, plus if you want the cache to be 
persistent between runs of an app, you have to get the kernel to do it).

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 21:56                               ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-20 21:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> Filesystems *must take locks* in the IO path. We have to serialise
> against truncate and other operations at some point in the IO path
> (e.g. block mapping vs concurrent allocation and/or removal), and
> that can only be done sanely with sleeping locks.  There is no way
> of knowing in advance if we are going to block, and so either we
> always use threads for IO submission or we accept that occasionally
> the AIO submission will block.

I never said we don't take locks.  Still, we can be more intelligent 
about when and where we do so.  With the nonblocking pread() and pwrite() 
changes being proposed elsewhere, we can do the part of the I/O that 
doesn't block in the submitter, which is a huge win when possible.

As it stands today, *every* buffered write takes i_mutex immediately 
on entering ->write().  That one issue alone accounts for a nearly 10x 
performance difference between an O_SYNC write and an O_DIRECT write, 
and using O_SYNC writes is a legitimate use-case for users who want 
caching of data by the kernel (duplicating that functionality is a huge 
amount of work for an application, plus if you want the cache to be 
persistent between runs of an app, you have to get the kernel to do it).

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 21:56                               ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-20 21:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> Filesystems *must take locks* in the IO path. We have to serialise
> against truncate and other operations at some point in the IO path
> (e.g. block mapping vs concurrent allocation and/or removal), and
> that can only be done sanely with sleeping locks.  There is no way
> of knowing in advance if we are going to block, and so either we
> always use threads for IO submission or we accept that occasionally
> the AIO submission will block.

I never said we don't take locks.  Still, we can be more intelligent 
about when and where we do so.  With the nonblocking pread() and pwrite() 
changes being proposed elsewhere, we can do the part of the I/O that 
doesn't block in the submitter, which is a huge win when possible.

As it stands today, *every* buffered write takes i_mutex immediately 
on entering ->write().  That one issue alone accounts for a nearly 10x 
performance difference between an O_SYNC write and an O_DIRECT write, 
and using O_SYNC writes is a legitimate use-case for users who want 
caching of data by the kernel (duplicating that functionality is a huge 
amount of work for an application, plus if you want the cache to be 
persistent between runs of an app, you have to get the kernel to do it).

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 20:29                         ` Linus Torvalds
  (?)
@ 2016-01-20 21:57                           ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

Well, they have mentioned that openat() can block, as will the first
operation after open that requires reading the file extent map from
disk. There are some ways of hacking around this (e.g. running
FIEMAP with a zero extent count or ext4's special extent prefetch
ioctl in a separate thread to prefetch the extent list into memory
before IO is required) so I suspect we may actually need some
interfaces that don't current exist at all....

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

That would make things a lot simpler in the kernel and AIO
submission a lot more predictable/deterministic for userspace. I'd
suggest that, at minimum, it should be the default behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 21:57                           ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

Well, they have mentioned that openat() can block, as will the first
operation after open that requires reading the file extent map from
disk. There are some ways of hacking around this (e.g. running
FIEMAP with a zero extent count or ext4's special extent prefetch
ioctl in a separate thread to prefetch the extent list into memory
before IO is required) so I suspect we may actually need some
interfaces that don't current exist at all....

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

That would make things a lot simpler in the kernel and AIO
submission a lot more predictable/deterministic for userspace. I'd
suggest that, at minimum, it should be the default behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-20 21:57                           ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-20 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, linux-aio-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel, Linux Kernel Mailing List, Linux API, linux-mm,
	Alexander Viro, Andrew Morton

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

Well, they have mentioned that openat() can block, as will the first
operation after open that requires reading the file extent map from
disk. There are some ways of hacking around this (e.g. running
FIEMAP with a zero extent count or ext4's special extent prefetch
ioctl in a separate thread to prefetch the extent list into memory
before IO is required) so I suspect we may actually need some
interfaces that don't current exist at all....

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

That would make things a lot simpler in the kernel and AIO
submission a lot more predictable/deterministic for userspace. I'd
suggest that, at minimum, it should be the default behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 21:45                             ` Dave Chinner
  (?)
  (?)
@ 2016-01-20 23:07                             ` Linus Torvalds
  2016-01-23  4:39                                 ` Dave Chinner
  -1 siblings, 1 reply; 133+ messages in thread
From: Linus Torvalds @ 2016-01-20 23:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]

On Jan 20, 2016 1:46 PM, "Dave Chinner" <david@fromorbit.com> wrote:
> >
> > > That said, I also agree that it would be interesting to hear what the
> > > performance impact is for existing performance-sensitive users. Could
> > > we make that "aio_may_use_threads()" case be unconditional, making
> > > things simpler?
> >
> > Making it unconditional is a goal, but some work is required before that
> > can be the case.  The O_DIRECT issue is one such matter -- it requires
some
> > changes to the filesystems to ensure that they adhere to the
non-blocking
> > nature of the new interface (ie taking i_mutex is a Bad Thing that users
> > really do not want to be exposed to; if taking it blocks, the code
should
> > punt to a helper thread).
>
> Filesystems *must take locks* in the IO path.

I agree.

I also would prefer to make the aio code have as little interaction and
magic flags with the filesystem code as humanly possible.

I wonder if we could make the rough rule be that the only synchronous case
the aio code ever has is more or less entirely in the generic vfs caches?
IOW, could we possibly aim to make the rule be that if we call down to the
filesystem layer, we do that within a thread?

We could do things like that for the name loopkup for openat() too, where
we could handle the successful RCU loopkup synchronously, but then if we
fall out of RCU mode we'd do the thread.

    Linus

[-- Attachment #2: Type: text/html, Size: 1750 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-12  1:11     ` Dave Chinner
  (?)
@ 2016-01-22 15:31       ` Andres Freund
  -1 siblings, 0 replies; 133+ messages in thread
From: Andres Freund @ 2016-01-22 15:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel, linux-kernel,
	linux-api, linux-mm, Alexander Viro, Andrew Morton,
	Linus Torvalds

On 2016-01-12 12:11:28 +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:
> 
> https://lkml.org/lkml/2015/10/28/878

FWIW, I finally started to play around with this (or more precisely
https://lkml.org/lkml/2015/10/29/517). There were some prerequisite
changes in postgres required, to actually be able to benefit, delaying
things.  First results are good, increasing OLTP throughput
considerably.

It'd also be rather helpful to be able to do
sync_file_range(SYNC_FILE_RANGE_WRITE) asynchronously, i.e. flush
without an implied barrier. Currently this blocks very frequently, even
if there's actually IO bandwidth available.

Regards,

Andres

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-22 15:31       ` Andres Freund
  0 siblings, 0 replies; 133+ messages in thread
From: Andres Freund @ 2016-01-22 15:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel, linux-kernel,
	linux-api, linux-mm, Alexander Viro, Andrew Morton,
	Linus Torvalds

On 2016-01-12 12:11:28 +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:
> 
> https://lkml.org/lkml/2015/10/28/878

FWIW, I finally started to play around with this (or more precisely
https://lkml.org/lkml/2015/10/29/517). There were some prerequisite
changes in postgres required, to actually be able to benefit, delaying
things.  First results are good, increasing OLTP throughput
considerably.

It'd also be rather helpful to be able to do
sync_file_range(SYNC_FILE_RANGE_WRITE) asynchronously, i.e. flush
without an implied barrier. Currently this blocks very frequently, even
if there's actually IO bandwidth available.

Regards,

Andres

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-22 15:31       ` Andres Freund
  0 siblings, 0 replies; 133+ messages in thread
From: Andres Freund @ 2016-01-22 15:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Benjamin LaHaise, linux-aio, linux-fsdevel, linux-kernel,
	linux-api, linux-mm, Alexander Viro, Andrew Morton,
	Linus Torvalds

On 2016-01-12 12:11:28 +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:
> 
> https://lkml.org/lkml/2015/10/28/878

FWIW, I finally started to play around with this (or more precisely
https://lkml.org/lkml/2015/10/29/517). There were some prerequisite
changes in postgres required, to actually be able to benefit, delaying
things.  First results are good, increasing OLTP throughput
considerably.

It'd also be rather helpful to be able to do
sync_file_range(SYNC_FILE_RANGE_WRITE) asynchronously, i.e. flush
without an implied barrier. Currently this blocks very frequently, even
if there's actually IO bandwidth available.

Regards,

Andres

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20  3:59                     ` Linus Torvalds
@ 2016-01-22 15:41                       ` Andres Freund
  -1 siblings, 0 replies; 133+ messages in thread
From: Andres Freund @ 2016-01-22 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On 2016-01-19 19:59:35 -0800, Linus Torvalds wrote:
> Are there other users outside of Solace? It would be good to get comments..

PostgreSQL is a potential user of async fdatasync, fsync,
sync_file_range and potentially readahead, write, read. First tests with
Dave's async fsync/fsync_range are positive, so are the results with a
self-hacked async sync_file_range (although I'm kinda thinking that it
shouldn't really require to be used asynchronously).

I rather doubt openat, unlink et al are going to be interesting for
*us*, the requires structural changes would be too bit. But obviously
that doesn't mean anything for others.

Andres

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-22 15:41                       ` Andres Freund
  0 siblings, 0 replies; 133+ messages in thread
From: Andres Freund @ 2016-01-22 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On 2016-01-19 19:59:35 -0800, Linus Torvalds wrote:
> Are there other users outside of Solace? It would be good to get comments..

PostgreSQL is a potential user of async fdatasync, fsync,
sync_file_range and potentially readahead, write, read. First tests with
Dave's async fsync/fsync_range are positive, so are the results with a
self-hacked async sync_file_range (although I'm kinda thinking that it
shouldn't really require to be used asynchronously).

I rather doubt openat, unlink et al are going to be interesting for
*us*, the requires structural changes would be too bit. But obviously
that doesn't mean anything for others.

Andres

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 21:56                               ` Benjamin LaHaise
@ 2016-01-23  4:24                                 ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23  4:24 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > Filesystems *must take locks* in the IO path. We have to serialise
> > against truncate and other operations at some point in the IO path
> > (e.g. block mapping vs concurrent allocation and/or removal), and
> > that can only be done sanely with sleeping locks.  There is no way
> > of knowing in advance if we are going to block, and so either we
> > always use threads for IO submission or we accept that occasionally
> > the AIO submission will block.
> 
> I never said we don't take locks.  Still, we can be more intelligent 
> about when and where we do so.  With the nonblocking pread() and pwrite() 
> changes being proposed elsewhere, we can do the part of the I/O that 
> doesn't block in the submitter, which is a huge win when possible.
> 
> As it stands today, *every* buffered write takes i_mutex immediately 
> on entering ->write().  That one issue alone accounts for a nearly 10x 
> performance difference between an O_SYNC write and an O_DIRECT write, 

Yes, that locking is for correct behaviour, not for performance
reasons.  The i_mutex is providing the required semantics for POSIX
write(2) functionality - writes must serialise against other reads
and writes so that they are completed atomically w.r.t. other IO.
i.e. writes to the same offset must not interleave, not should reads
be able to see partial data from a write in progress.

Direct IO does not conform to POSIX concurrency standards, so we
don't have to serialise concurrent IO against each other.

> and using O_SYNC writes is a legitimate use-case for users who want 
> caching of data by the kernel (duplicating that functionality is a huge 
> amount of work for an application, plus if you want the cache to be 
> persistent between runs of an app, you have to get the kernel to do it).

Yes, but you take what you get given. Buffered IO sucks in many ways;
this is just one of them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23  4:24                                 ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23  4:24 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > Filesystems *must take locks* in the IO path. We have to serialise
> > against truncate and other operations at some point in the IO path
> > (e.g. block mapping vs concurrent allocation and/or removal), and
> > that can only be done sanely with sleeping locks.  There is no way
> > of knowing in advance if we are going to block, and so either we
> > always use threads for IO submission or we accept that occasionally
> > the AIO submission will block.
> 
> I never said we don't take locks.  Still, we can be more intelligent 
> about when and where we do so.  With the nonblocking pread() and pwrite() 
> changes being proposed elsewhere, we can do the part of the I/O that 
> doesn't block in the submitter, which is a huge win when possible.
> 
> As it stands today, *every* buffered write takes i_mutex immediately 
> on entering ->write().  That one issue alone accounts for a nearly 10x 
> performance difference between an O_SYNC write and an O_DIRECT write, 

Yes, that locking is for correct behaviour, not for performance
reasons.  The i_mutex is providing the required semantics for POSIX
write(2) functionality - writes must serialise against other reads
and writes so that they are completed atomically w.r.t. other IO.
i.e. writes to the same offset must not interleave, not should reads
be able to see partial data from a write in progress.

Direct IO does not conform to POSIX concurrency standards, so we
don't have to serialise concurrent IO against each other.

> and using O_SYNC writes is a legitimate use-case for users who want 
> caching of data by the kernel (duplicating that functionality is a huge 
> amount of work for an application, plus if you want the cache to be 
> persistent between runs of an app, you have to get the kernel to do it).

Yes, but you take what you get given. Buffered IO sucks in many ways;
this is just one of them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-20 23:07                             ` Linus Torvalds
  2016-01-23  4:39                                 ` Dave Chinner
@ 2016-01-23  4:39                                 ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23  4:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
> On Jan 20, 2016 1:46 PM, "Dave Chinner" <david@fromorbit.com> wrote:
> > >
> > > > That said, I also agree that it would be interesting to hear what the
> > > > performance impact is for existing performance-sensitive users. Could
> > > > we make that "aio_may_use_threads()" case be unconditional, making
> > > > things simpler?
> > >
> > > Making it unconditional is a goal, but some work is required before that
> > > can be the case.  The O_DIRECT issue is one such matter -- it requires
> some
> > > changes to the filesystems to ensure that they adhere to the
> non-blocking
> > > nature of the new interface (ie taking i_mutex is a Bad Thing that users
> > > really do not want to be exposed to; if taking it blocks, the code
> should
> > > punt to a helper thread).
> >
> > Filesystems *must take locks* in the IO path.
> 
> I agree.
> 
> I also would prefer to make the aio code have as little interaction and
> magic flags with the filesystem code as humanly possible.
> 
> I wonder if we could make the rough rule be that the only synchronous case
> the aio code ever has is more or less entirely in the generic vfs caches?
> IOW, could we possibly aim to make the rule be that if we call down to the
> filesystem layer, we do that within a thread?

We have to go through the filesystem layer locking even on page
cache hits, and even if we get into the page cache copy-in/copy-out
code we can still get stuck on things like page locks and page
faults. Even if hte pages are cached, we can still get caught on
deeper filesystem locks for block mapping. e.g. read from a hole,
get zeros back, page cache is populated. Write data into range,
fetch page, realise it's unmapped, need to do block/delayed
allocation which requires filesystem locks and potentially
transactions and IO....

> We could do things like that for the name loopkup for openat() too, where
> we could handle the successful RCU loopkup synchronously, but then if we
> fall out of RCU mode we'd do the thread.

We'd have to do quite a bit of work to unwind back out to the AIO
layer before we can dispatch the open operation again in a thread,
wouldn't we?

So I'm not convinced that conditional thread dispatch makes sense. I
think the simplest thing to do is make all AIO use threads/
workqueues by default, and if the application is smart enough to
only do things that minimise blocking they can turn off the threaded
dispatch and get the same behaviour they get now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23  4:39                                 ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23  4:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
> On Jan 20, 2016 1:46 PM, "Dave Chinner" <david@fromorbit.com> wrote:
> > >
> > > > That said, I also agree that it would be interesting to hear what the
> > > > performance impact is for existing performance-sensitive users. Could
> > > > we make that "aio_may_use_threads()" case be unconditional, making
> > > > things simpler?
> > >
> > > Making it unconditional is a goal, but some work is required before that
> > > can be the case.  The O_DIRECT issue is one such matter -- it requires
> some
> > > changes to the filesystems to ensure that they adhere to the
> non-blocking
> > > nature of the new interface (ie taking i_mutex is a Bad Thing that users
> > > really do not want to be exposed to; if taking it blocks, the code
> should
> > > punt to a helper thread).
> >
> > Filesystems *must take locks* in the IO path.
> 
> I agree.
> 
> I also would prefer to make the aio code have as little interaction and
> magic flags with the filesystem code as humanly possible.
> 
> I wonder if we could make the rough rule be that the only synchronous case
> the aio code ever has is more or less entirely in the generic vfs caches?
> IOW, could we possibly aim to make the rule be that if we call down to the
> filesystem layer, we do that within a thread?

We have to go through the filesystem layer locking even on page
cache hits, and even if we get into the page cache copy-in/copy-out
code we can still get stuck on things like page locks and page
faults. Even if hte pages are cached, we can still get caught on
deeper filesystem locks for block mapping. e.g. read from a hole,
get zeros back, page cache is populated. Write data into range,
fetch page, realise it's unmapped, need to do block/delayed
allocation which requires filesystem locks and potentially
transactions and IO....

> We could do things like that for the name loopkup for openat() too, where
> we could handle the successful RCU loopkup synchronously, but then if we
> fall out of RCU mode we'd do the thread.

We'd have to do quite a bit of work to unwind back out to the AIO
layer before we can dispatch the open operation again in a thread,
wouldn't we?

So I'm not convinced that conditional thread dispatch makes sense. I
think the simplest thing to do is make all AIO use threads/
workqueues by default, and if the application is smart enough to
only do things that minimise blocking they can turn off the threaded
dispatch and get the same behaviour they get now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23  4:39                                 ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23  4:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Benjamin LaHaise, Linux Kernel Mailing List, linux-aio, linux-mm

On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
> On Jan 20, 2016 1:46 PM, "Dave Chinner" <david@fromorbit.com> wrote:
> > >
> > > > That said, I also agree that it would be interesting to hear what the
> > > > performance impact is for existing performance-sensitive users. Could
> > > > we make that "aio_may_use_threads()" case be unconditional, making
> > > > things simpler?
> > >
> > > Making it unconditional is a goal, but some work is required before that
> > > can be the case.  The O_DIRECT issue is one such matter -- it requires
> some
> > > changes to the filesystems to ensure that they adhere to the
> non-blocking
> > > nature of the new interface (ie taking i_mutex is a Bad Thing that users
> > > really do not want to be exposed to; if taking it blocks, the code
> should
> > > punt to a helper thread).
> >
> > Filesystems *must take locks* in the IO path.
> 
> I agree.
> 
> I also would prefer to make the aio code have as little interaction and
> magic flags with the filesystem code as humanly possible.
> 
> I wonder if we could make the rough rule be that the only synchronous case
> the aio code ever has is more or less entirely in the generic vfs caches?
> IOW, could we possibly aim to make the rule be that if we call down to the
> filesystem layer, we do that within a thread?

We have to go through the filesystem layer locking even on page
cache hits, and even if we get into the page cache copy-in/copy-out
code we can still get stuck on things like page locks and page
faults. Even if hte pages are cached, we can still get caught on
deeper filesystem locks for block mapping. e.g. read from a hole,
get zeros back, page cache is populated. Write data into range,
fetch page, realise it's unmapped, need to do block/delayed
allocation which requires filesystem locks and potentially
transactions and IO....

> We could do things like that for the name loopkup for openat() too, where
> we could handle the successful RCU loopkup synchronously, but then if we
> fall out of RCU mode we'd do the thread.

We'd have to do quite a bit of work to unwind back out to the AIO
layer before we can dispatch the open operation again in a thread,
wouldn't we?

So I'm not convinced that conditional thread dispatch makes sense. I
think the simplest thing to do is make all AIO use threads/
workqueues by default, and if the application is smart enough to
only do things that minimise blocking they can turn off the threaded
dispatch and get the same behaviour they get now.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-23  4:24                                 ` Dave Chinner
  (?)
@ 2016-01-23  4:50                                   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-23  4:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > Filesystems *must take locks* in the IO path. We have to serialise
> > > against truncate and other operations at some point in the IO path
> > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > that can only be done sanely with sleeping locks.  There is no way
> > > of knowing in advance if we are going to block, and so either we
> > > always use threads for IO submission or we accept that occasionally
> > > the AIO submission will block.
> > 
> > I never said we don't take locks.  Still, we can be more intelligent 
> > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > changes being proposed elsewhere, we can do the part of the I/O that 
> > doesn't block in the submitter, which is a huge win when possible.
> > 
> > As it stands today, *every* buffered write takes i_mutex immediately 
> > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > performance difference between an O_SYNC write and an O_DIRECT write, 
> 
> Yes, that locking is for correct behaviour, not for performance
> reasons.  The i_mutex is providing the required semantics for POSIX
> write(2) functionality - writes must serialise against other reads
> and writes so that they are completed atomically w.r.t. other IO.
> i.e. writes to the same offset must not interleave, not should reads
> be able to see partial data from a write in progress.

No, the locks are not *required* for POSIX semantics, they are a legacy
of how Linux filesystem code has been implemented and how we ensure the
necessary internal consistency needed inside our filesystems is
provided.  There are other ways to achieve the required semantics that
do not involve a single giant lock for the entire file/inode.  And no, I
am not saying that doing this is simple or easy to do.

		-ben

> Direct IO does not conform to POSIX concurrency standards, so we
> don't have to serialise concurrent IO against each other.
> 
> > and using O_SYNC writes is a legitimate use-case for users who want 
> > caching of data by the kernel (duplicating that functionality is a huge 
> > amount of work for an application, plus if you want the cache to be 
> > persistent between runs of an app, you have to get the kernel to do it).
> 
> Yes, but you take what you get given. Buffered IO sucks in many ways;
> this is just one of them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23  4:50                                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-23  4:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > Filesystems *must take locks* in the IO path. We have to serialise
> > > against truncate and other operations at some point in the IO path
> > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > that can only be done sanely with sleeping locks.  There is no way
> > > of knowing in advance if we are going to block, and so either we
> > > always use threads for IO submission or we accept that occasionally
> > > the AIO submission will block.
> > 
> > I never said we don't take locks.  Still, we can be more intelligent 
> > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > changes being proposed elsewhere, we can do the part of the I/O that 
> > doesn't block in the submitter, which is a huge win when possible.
> > 
> > As it stands today, *every* buffered write takes i_mutex immediately 
> > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > performance difference between an O_SYNC write and an O_DIRECT write, 
> 
> Yes, that locking is for correct behaviour, not for performance
> reasons.  The i_mutex is providing the required semantics for POSIX
> write(2) functionality - writes must serialise against other reads
> and writes so that they are completed atomically w.r.t. other IO.
> i.e. writes to the same offset must not interleave, not should reads
> be able to see partial data from a write in progress.

No, the locks are not *required* for POSIX semantics, they are a legacy
of how Linux filesystem code has been implemented and how we ensure the
necessary internal consistency needed inside our filesystems is
provided.  There are other ways to achieve the required semantics that
do not involve a single giant lock for the entire file/inode.  And no, I
am not saying that doing this is simple or easy to do.

		-ben

> Direct IO does not conform to POSIX concurrency standards, so we
> don't have to serialise concurrent IO against each other.
> 
> > and using O_SYNC writes is a legitimate use-case for users who want 
> > caching of data by the kernel (duplicating that functionality is a huge 
> > amount of work for an application, plus if you want the cache to be 
> > persistent between runs of an app, you have to get the kernel to do it).
> 
> Yes, but you take what you get given. Buffered IO sucks in many ways;
> this is just one of them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23  4:50                                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-01-23  4:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > Filesystems *must take locks* in the IO path. We have to serialise
> > > against truncate and other operations at some point in the IO path
> > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > that can only be done sanely with sleeping locks.  There is no way
> > > of knowing in advance if we are going to block, and so either we
> > > always use threads for IO submission or we accept that occasionally
> > > the AIO submission will block.
> > 
> > I never said we don't take locks.  Still, we can be more intelligent 
> > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > changes being proposed elsewhere, we can do the part of the I/O that 
> > doesn't block in the submitter, which is a huge win when possible.
> > 
> > As it stands today, *every* buffered write takes i_mutex immediately 
> > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > performance difference between an O_SYNC write and an O_DIRECT write, 
> 
> Yes, that locking is for correct behaviour, not for performance
> reasons.  The i_mutex is providing the required semantics for POSIX
> write(2) functionality - writes must serialise against other reads
> and writes so that they are completed atomically w.r.t. other IO.
> i.e. writes to the same offset must not interleave, not should reads
> be able to see partial data from a write in progress.

No, the locks are not *required* for POSIX semantics, they are a legacy
of how Linux filesystem code has been implemented and how we ensure the
necessary internal consistency needed inside our filesystems is
provided.  There are other ways to achieve the required semantics that
do not involve a single giant lock for the entire file/inode.  And no, I
am not saying that doing this is simple or easy to do.

		-ben

> Direct IO does not conform to POSIX concurrency standards, so we
> don't have to serialise concurrent IO against each other.
> 
> > and using O_SYNC writes is a legitimate use-case for users who want 
> > caching of data by the kernel (duplicating that functionality is a huge 
> > amount of work for an application, plus if you want the cache to be 
> > persistent between runs of an app, you have to get the kernel to do it).
> 
> Yes, but you take what you get given. Buffered IO sucks in many ways;
> this is just one of them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-23  4:50                                   ` Benjamin LaHaise
  (?)
@ 2016-01-23 22:22                                     ` Dave Chinner
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23 22:22 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Fri, Jan 22, 2016 at 11:50:24PM -0500, Benjamin LaHaise wrote:
> On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> > On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > > Filesystems *must take locks* in the IO path. We have to serialise
> > > > against truncate and other operations at some point in the IO path
> > > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > > that can only be done sanely with sleeping locks.  There is no way
> > > > of knowing in advance if we are going to block, and so either we
> > > > always use threads for IO submission or we accept that occasionally
> > > > the AIO submission will block.
> > > 
> > > I never said we don't take locks.  Still, we can be more intelligent 
> > > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > > changes being proposed elsewhere, we can do the part of the I/O that 
> > > doesn't block in the submitter, which is a huge win when possible.
> > > 
> > > As it stands today, *every* buffered write takes i_mutex immediately 
> > > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > > performance difference between an O_SYNC write and an O_DIRECT write, 
> > 
> > Yes, that locking is for correct behaviour, not for performance
> > reasons.  The i_mutex is providing the required semantics for POSIX
> > write(2) functionality - writes must serialise against other reads
> > and writes so that they are completed atomically w.r.t. other IO.
> > i.e. writes to the same offset must not interleave, not should reads
> > be able to see partial data from a write in progress.
> 
> No, the locks are not *required* for POSIX semantics, they are a legacy
> of how Linux filesystem code has been implemented and how we ensure the
> necessary internal consistency needed inside our filesystems is
> provided.

That may be the case, but I really don't see how you can provide
such required functionality without some kind of exclusion barrier
in place. No matter how you implement that exclusion, it can be seen
effectively as a lock.

Even if the filesystem doesn't use the i_mutex for exclusion to the
page cache, it has to use some kind of lock as that IO still needs
to be serialised against any truncate, hole punch or other extent
manipulation that is currently in progress on the inode...

> There are other ways to achieve the required semantics that
> do not involve a single giant lock for the entire file/inode.

Most performant filesystems don't have a "single giant lock"
anymore. The problem is that the VFS expects the i_mutex to be held
for certain operations in the IO path and the VFS lock order
heirarchy makes it impossible to do anything but "get i_mutex
first".  That's the problem that needs to be solved - the VFS
enforces the "one giant lock" model, even when underlying
filesystems do not require it.

i.e. we could quite happily remove the i_mutex completely from the XFS
buffered IO path without breaking anything, but we can't because
that results in the VFS throwing warnings that we don't hold the
i_mutex (e.g like when removing the SUID bits on write). So there's
lots of VFS functionality that needs to be turned on it's head
before the i_mutex can be removed from the IO path.

> And no, I
> am not saying that doing this is simple or easy to do.

Sure. That's always been the problem. Even when a split IO/metadata
locking strategy like what XFS uses (and other modern filesystems
are moving to internally) is suggested as a model for solving
these problems, the usual response instant dismissal with
"no way, that's unworkable" and so nothing ever changes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23 22:22                                     ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23 22:22 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Fri, Jan 22, 2016 at 11:50:24PM -0500, Benjamin LaHaise wrote:
> On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> > On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > > Filesystems *must take locks* in the IO path. We have to serialise
> > > > against truncate and other operations at some point in the IO path
> > > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > > that can only be done sanely with sleeping locks.  There is no way
> > > > of knowing in advance if we are going to block, and so either we
> > > > always use threads for IO submission or we accept that occasionally
> > > > the AIO submission will block.
> > > 
> > > I never said we don't take locks.  Still, we can be more intelligent 
> > > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > > changes being proposed elsewhere, we can do the part of the I/O that 
> > > doesn't block in the submitter, which is a huge win when possible.
> > > 
> > > As it stands today, *every* buffered write takes i_mutex immediately 
> > > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > > performance difference between an O_SYNC write and an O_DIRECT write, 
> > 
> > Yes, that locking is for correct behaviour, not for performance
> > reasons.  The i_mutex is providing the required semantics for POSIX
> > write(2) functionality - writes must serialise against other reads
> > and writes so that they are completed atomically w.r.t. other IO.
> > i.e. writes to the same offset must not interleave, not should reads
> > be able to see partial data from a write in progress.
> 
> No, the locks are not *required* for POSIX semantics, they are a legacy
> of how Linux filesystem code has been implemented and how we ensure the
> necessary internal consistency needed inside our filesystems is
> provided.

That may be the case, but I really don't see how you can provide
such required functionality without some kind of exclusion barrier
in place. No matter how you implement that exclusion, it can be seen
effectively as a lock.

Even if the filesystem doesn't use the i_mutex for exclusion to the
page cache, it has to use some kind of lock as that IO still needs
to be serialised against any truncate, hole punch or other extent
manipulation that is currently in progress on the inode...

> There are other ways to achieve the required semantics that
> do not involve a single giant lock for the entire file/inode.

Most performant filesystems don't have a "single giant lock"
anymore. The problem is that the VFS expects the i_mutex to be held
for certain operations in the IO path and the VFS lock order
heirarchy makes it impossible to do anything but "get i_mutex
first".  That's the problem that needs to be solved - the VFS
enforces the "one giant lock" model, even when underlying
filesystems do not require it.

i.e. we could quite happily remove the i_mutex completely from the XFS
buffered IO path without breaking anything, but we can't because
that results in the VFS throwing warnings that we don't hold the
i_mutex (e.g like when removing the SUID bits on write). So there's
lots of VFS functionality that needs to be turned on it's head
before the i_mutex can be removed from the IO path.

> And no, I
> am not saying that doing this is simple or easy to do.

Sure. That's always been the problem. Even when a split IO/metadata
locking strategy like what XFS uses (and other modern filesystems
are moving to internally) is suggested as a model for solving
these problems, the usual response instant dismissal with
"no way, that's unworkable" and so nothing ever changes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-01-23 22:22                                     ` Dave Chinner
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Chinner @ 2016-01-23 22:22 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, linux-aio-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel,
	Linux Kernel Mailing List, Linux API, linux-mm, Alexander Viro,
	Andrew Morton

On Fri, Jan 22, 2016 at 11:50:24PM -0500, Benjamin LaHaise wrote:
> On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> > On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > > Filesystems *must take locks* in the IO path. We have to serialise
> > > > against truncate and other operations at some point in the IO path
> > > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > > that can only be done sanely with sleeping locks.  There is no way
> > > > of knowing in advance if we are going to block, and so either we
> > > > always use threads for IO submission or we accept that occasionally
> > > > the AIO submission will block.
> > > 
> > > I never said we don't take locks.  Still, we can be more intelligent 
> > > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > > changes being proposed elsewhere, we can do the part of the I/O that 
> > > doesn't block in the submitter, which is a huge win when possible.
> > > 
> > > As it stands today, *every* buffered write takes i_mutex immediately 
> > > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > > performance difference between an O_SYNC write and an O_DIRECT write, 
> > 
> > Yes, that locking is for correct behaviour, not for performance
> > reasons.  The i_mutex is providing the required semantics for POSIX
> > write(2) functionality - writes must serialise against other reads
> > and writes so that they are completed atomically w.r.t. other IO.
> > i.e. writes to the same offset must not interleave, not should reads
> > be able to see partial data from a write in progress.
> 
> No, the locks are not *required* for POSIX semantics, they are a legacy
> of how Linux filesystem code has been implemented and how we ensure the
> necessary internal consistency needed inside our filesystems is
> provided.

That may be the case, but I really don't see how you can provide
such required functionality without some kind of exclusion barrier
in place. No matter how you implement that exclusion, it can be seen
effectively as a lock.

Even if the filesystem doesn't use the i_mutex for exclusion to the
page cache, it has to use some kind of lock as that IO still needs
to be serialised against any truncate, hole punch or other extent
manipulation that is currently in progress on the inode...

> There are other ways to achieve the required semantics that
> do not involve a single giant lock for the entire file/inode.

Most performant filesystems don't have a "single giant lock"
anymore. The problem is that the VFS expects the i_mutex to be held
for certain operations in the IO path and the VFS lock order
heirarchy makes it impossible to do anything but "get i_mutex
first".  That's the problem that needs to be solved - the VFS
enforces the "one giant lock" model, even when underlying
filesystems do not require it.

i.e. we could quite happily remove the i_mutex completely from the XFS
buffered IO path without breaking anything, but we can't because
that results in the VFS throwing warnings that we don't hold the
i_mutex (e.g like when removing the SUID bits on write). So there's
lots of VFS functionality that needs to be turned on it's head
before the i_mutex can be removed from the IO path.

> And no, I
> am not saying that doing this is simple or easy to do.

Sure. That's always been the problem. Even when a split IO/metadata
locking strategy like what XFS uses (and other modern filesystems
are moving to internally) is suggested as a model for solving
these problems, the usual response instant dismissal with
"no way, that's unworkable" and so nothing ever changes...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 133+ messages in thread

* aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-01-23  4:39                                 ` Dave Chinner
@ 2016-03-14 17:17                                   ` Benjamin LaHaise
  -1 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-03-14 17:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Jan 23, 2016 at 03:39:22PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
...
> > We could do things like that for the name loopkup for openat() too, where
> > we could handle the successful RCU loopkup synchronously, but then if we
> > fall out of RCU mode we'd do the thread.
> 
> We'd have to do quite a bit of work to unwind back out to the AIO
> layer before we can dispatch the open operation again in a thread,
> wouldn't we?

I had some time last week to make an aio openat do what it can in 
submit context.  The results are an improvement: when openat is handled 
in submit context it completes in about half the time it takes compared 
to the round trip via the work queue, and it's not terribly much code 
either.

		-ben
-- 
"Thought is the essence of where you are now."

 fs/aio.c              |  122 +++++++++++++++++++++++++++++++++++++++++---------
 fs/internal.h         |    1 
 fs/namei.c            |   16 ++++--
 fs/open.c             |    2 
 include/linux/namei.h |    1 
 5 files changed, 117 insertions(+), 25 deletions(-)

commit 5d3d80fcf99287decc4774af01967cebbb0242fd
Author: Benjamin LaHaise <bcrl@kvack.org>
Date:   Thu Mar 10 17:15:07 2016 -0500

    aio: add support for in-submit openat
    
    Using the LOOKUP_RCU infrastructure added for open(), implement such
    functionality to enable in io_submit() openat() that does a non-blocking
    file open operation.  This avoids the overhead of punting to another
    kernel thread to complete the open operation when the files and data are
    already in the dcache.  This helps cut simple aio openat() from ~60-90K
    cycles to ~24-45K cycles on my test system.
    
    Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

diff --git a/fs/aio.c b/fs/aio.c
index 0a9309e..67c58b6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,8 @@
 #include <linux/mount.h>
 #include <linux/fdtable.h>
 #include <linux/fs_struct.h>
+#include <linux/fsnotify.h>
+#include <linux/namei.h>
 #include <../mm/internal.h>
 
 #include <asm/kmap_types.h>
@@ -163,6 +165,7 @@ struct kioctx {
 
 struct aio_kiocb;
 typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
+typedef void (*aio_destruct_fn_t)(struct aio_kiocb *iocb);
 
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
@@ -210,6 +213,7 @@ struct aio_kiocb {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	struct task_struct	*ki_cancel_task;
 	unsigned long		ki_data;
+	unsigned long		ki_data2;
 	unsigned long		ki_rlimit_fsize;
 	unsigned		ki_thread_flags;	/* AIO_THREAD_NEED... */
 	aio_thread_work_fn_t	ki_work_fn;
@@ -217,6 +221,7 @@ struct aio_kiocb {
 	struct fs_struct	*ki_fs;
 	struct files_struct	*ki_files;
 	const struct cred	*ki_cred;
+	aio_destruct_fn_t	ki_destruct_fn;
 #endif
 };
 
@@ -1093,6 +1098,8 @@ out_put:
 
 static void kiocb_free(struct aio_kiocb *req)
 {
+	if (req->ki_destruct_fn)
+		req->ki_destruct_fn(req);
 	if (req->common.ki_filp)
 		fput(req->common.ki_filp);
 	if (req->ki_eventfd != NULL)
@@ -1546,6 +1553,18 @@ static void aio_thread_fn(struct work_struct *work)
 		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
 		ret = -EINTR;
 
+	/* Completion serializes cancellation by taking ctx_lock, so
+	 * aio_complete() will not return until after force_sig() in
+	 * aio_thread_queue_iocb_cancel().  This should ensure that
+	 * the signal is pending before being flushed in this thread.
+	 */
+	aio_complete(&iocb->common, ret, 0);
+	if (fatal_signal_pending(current))
+		flush_signals(current);
+
+	/* Clean up state after aio_complete() since ki_destruct may still
+	 * need to access them.
+	 */
 	if (iocb->ki_cred) {
 		current->cred = old_cred;
 		put_cred(iocb->ki_cred);
@@ -1558,15 +1577,6 @@ static void aio_thread_fn(struct work_struct *work)
 		exit_fs(current);
 		current->fs = old_fs;
 	}
-
-	/* Completion serializes cancellation by taking ctx_lock, so
-	 * aio_complete() will not return until after force_sig() in
-	 * aio_thread_queue_iocb_cancel().  This should ensure that
-	 * the signal is pending before being flushed in this thread.
-	 */
-	aio_complete(&iocb->common, ret, 0);
-	if (fatal_signal_pending(current))
-		flush_signals(current);
 }
 
 /* aio_thread_queue_iocb
@@ -1758,11 +1768,6 @@ static long aio_poll(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-static long aio_do_openat(int fd, const char *filename, int flags, int mode)
-{
-	return do_sys_open(fd, filename, flags, mode);
-}
-
 static long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
 {
 	if (flags || mode)
@@ -1793,14 +1798,91 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 	return ret;
 }
 
+static void openat_destruct(struct aio_kiocb *req)
+{
+	struct filename *filename = req->common.private;
+	int fd;
+
+	putname(filename);
+	fd = req->ki_data;
+	if (fd >= 0)
+		put_unused_fd(fd);
+}
+
+static long aio_thread_op_openat(struct aio_kiocb *req)
+{
+	struct filename *filename = req->common.private;
+	int mode = req->common.ki_pos >> 32;
+	int flags = req->common.ki_pos;
+	struct open_flags op;
+	struct file *f;
+	int dfd = req->ki_data2;
+
+	build_open_flags(flags, mode, &op);
+	f = do_filp_open(dfd, filename, &op);
+	if (!IS_ERR(f)) {
+		int fd = req->ki_data;
+		/* Prevent openat_destruct from doing put_unused_fd() */
+		req->ki_data = -1;
+		fsnotify_open(f);
+		fd_install(fd, f);
+		return fd;
+	}
+	return PTR_ERR(f);
+}
+
 static long aio_openat(struct aio_kiocb *req, struct iocb *uiocb, bool compat)
 {
-	req->ki_data = (unsigned long)(void *)aio_do_openat;
-	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
-				     AIO_THREAD_NEED_TASK |
-				     AIO_THREAD_NEED_MM |
-				     AIO_THREAD_NEED_FILES |
-				     AIO_THREAD_NEED_CRED);
+	int mode = req->common.ki_pos >> 32;
+	struct filename *filename;
+	struct open_flags op;
+	int flags;
+	int fd;
+
+	if (force_o_largefile())
+		req->common.ki_pos |= O_LARGEFILE;
+	flags = req->common.ki_pos;
+	fd = build_open_flags(flags, mode, &op);
+	if (fd)
+		goto out_err;
+
+	filename = getname((const char __user *)(long)uiocb->aio_buf);
+	if (IS_ERR(filename)) {
+		fd = PTR_ERR(filename);
+		goto out_err;
+	}
+	req->common.private = filename;
+	req->ki_destruct_fn = openat_destruct;
+	req->ki_data = fd = get_unused_fd_flags(flags);
+	if (fd >= 0) {
+		struct file *f;
+		op.lookup_flags |= LOOKUP_RCU | LOOKUP_NONBLOCK;
+		req->ki_data = fd;
+		req->ki_data2 = uiocb->aio_fildes;
+		f = do_filp_open(uiocb->aio_fildes, filename, &op);
+		if (IS_ERR(f) && ((PTR_ERR(f) == -ECHILD) ||
+				  (PTR_ERR(f) == -ESTALE) ||
+				  (PTR_ERR(f) == -EAGAIN))) {
+			int ret;
+			ret = aio_thread_queue_iocb(req, aio_thread_op_openat,
+						   AIO_THREAD_NEED_TASK |
+						   AIO_THREAD_NEED_FILES |
+						   AIO_THREAD_NEED_CRED);
+			if (ret == -EIOCBQUEUED)
+				return ret;
+			put_unused_fd(fd);
+			fd = ret;
+		} else if (IS_ERR(f)) {
+			put_unused_fd(fd);
+			fd = PTR_ERR(f);
+		} else {
+			fsnotify_open(f);
+			fd_install(fd, f);
+		}
+	}
+out_err:
+	aio_complete(&req->common, fd, 0);
+	return -EIOCBQUEUED;
 }
 
 static long aio_unlink(struct aio_kiocb *req, struct iocb *uiocb, bool compt)
diff --git a/fs/internal.h b/fs/internal.h
index 57b6010..c421572 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -102,6 +102,7 @@ struct open_flags {
 	int intent;
 	int lookup_flags;
 };
+extern int build_open_flags(int flags, umode_t mode, struct open_flags *op);
 extern struct file *do_filp_open(int dfd, struct filename *pathname,
 		const struct open_flags *op);
 extern struct file *do_file_open_root(struct dentry *, struct vfsmount *,
diff --git a/fs/namei.c b/fs/namei.c
index 84ecc7e..260782f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3079,6 +3079,12 @@ retry_lookup:
 		 * dropping this one anyway.
 		 */
 	}
+
+	if (nd->flags & LOOKUP_NONBLOCK) {
+		error = -EAGAIN;
+		goto out;
+	}
+		
 	mutex_lock(&dir->d_inode->i_mutex);
 	error = lookup_open(nd, &path, file, op, got_write, opened);
 	mutex_unlock(&dir->d_inode->i_mutex);
@@ -3356,10 +3362,12 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 
 	set_nameidata(&nd, dfd, pathname);
 	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
-	if (unlikely(filp == ERR_PTR(-ECHILD)))
-		filp = path_openat(&nd, op, flags);
-	if (unlikely(filp == ERR_PTR(-ESTALE)))
-		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
+	if (!(op->lookup_flags & LOOKUP_RCU)) {
+		if (unlikely(filp == ERR_PTR(-ECHILD)))
+			filp = path_openat(&nd, op, flags);
+		if (unlikely(filp == ERR_PTR(-ESTALE)))
+			filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
+	}
 	restore_nameidata();
 	return filp;
 }
diff --git a/fs/open.c b/fs/open.c
index b6f1e96..f6a45cb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -884,7 +884,7 @@ struct file *dentry_open(const struct path *path, int flags,
 }
 EXPORT_SYMBOL(dentry_open);
 
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
 {
 	int lookup_flags = 0;
 	int acc_mode;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index d8c6334..1e76579 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -43,6 +43,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_JUMPED		0x1000
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_EMPTY		0x4000
+#define LOOKUP_NONBLOCK		0x8000
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
 

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-14 17:17                                   ` Benjamin LaHaise
  0 siblings, 0 replies; 133+ messages in thread
From: Benjamin LaHaise @ 2016-03-14 17:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Jan 23, 2016 at 03:39:22PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
...
> > We could do things like that for the name loopkup for openat() too, where
> > we could handle the successful RCU loopkup synchronously, but then if we
> > fall out of RCU mode we'd do the thread.
> 
> We'd have to do quite a bit of work to unwind back out to the AIO
> layer before we can dispatch the open operation again in a thread,
> wouldn't we?

I had some time last week to make an aio openat do what it can in 
submit context.  The results are an improvement: when openat is handled 
in submit context it completes in about half the time it takes compared 
to the round trip via the work queue, and it's not terribly much code 
either.

		-ben
-- 
"Thought is the essence of where you are now."

 fs/aio.c              |  122 +++++++++++++++++++++++++++++++++++++++++---------
 fs/internal.h         |    1 
 fs/namei.c            |   16 ++++--
 fs/open.c             |    2 
 include/linux/namei.h |    1 
 5 files changed, 117 insertions(+), 25 deletions(-)

commit 5d3d80fcf99287decc4774af01967cebbb0242fd
Author: Benjamin LaHaise <bcrl@kvack.org>
Date:   Thu Mar 10 17:15:07 2016 -0500

    aio: add support for in-submit openat
    
    Using the LOOKUP_RCU infrastructure added for open(), implement such
    functionality to enable in io_submit() openat() that does a non-blocking
    file open operation.  This avoids the overhead of punting to another
    kernel thread to complete the open operation when the files and data are
    already in the dcache.  This helps cut simple aio openat() from ~60-90K
    cycles to ~24-45K cycles on my test system.
    
    Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

diff --git a/fs/aio.c b/fs/aio.c
index 0a9309e..67c58b6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,8 @@
 #include <linux/mount.h>
 #include <linux/fdtable.h>
 #include <linux/fs_struct.h>
+#include <linux/fsnotify.h>
+#include <linux/namei.h>
 #include <../mm/internal.h>
 
 #include <asm/kmap_types.h>
@@ -163,6 +165,7 @@ struct kioctx {
 
 struct aio_kiocb;
 typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
+typedef void (*aio_destruct_fn_t)(struct aio_kiocb *iocb);
 
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
@@ -210,6 +213,7 @@ struct aio_kiocb {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	struct task_struct	*ki_cancel_task;
 	unsigned long		ki_data;
+	unsigned long		ki_data2;
 	unsigned long		ki_rlimit_fsize;
 	unsigned		ki_thread_flags;	/* AIO_THREAD_NEED... */
 	aio_thread_work_fn_t	ki_work_fn;
@@ -217,6 +221,7 @@ struct aio_kiocb {
 	struct fs_struct	*ki_fs;
 	struct files_struct	*ki_files;
 	const struct cred	*ki_cred;
+	aio_destruct_fn_t	ki_destruct_fn;
 #endif
 };
 
@@ -1093,6 +1098,8 @@ out_put:
 
 static void kiocb_free(struct aio_kiocb *req)
 {
+	if (req->ki_destruct_fn)
+		req->ki_destruct_fn(req);
 	if (req->common.ki_filp)
 		fput(req->common.ki_filp);
 	if (req->ki_eventfd != NULL)
@@ -1546,6 +1553,18 @@ static void aio_thread_fn(struct work_struct *work)
 		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
 		ret = -EINTR;
 
+	/* Completion serializes cancellation by taking ctx_lock, so
+	 * aio_complete() will not return until after force_sig() in
+	 * aio_thread_queue_iocb_cancel().  This should ensure that
+	 * the signal is pending before being flushed in this thread.
+	 */
+	aio_complete(&iocb->common, ret, 0);
+	if (fatal_signal_pending(current))
+		flush_signals(current);
+
+	/* Clean up state after aio_complete() since ki_destruct may still
+	 * need to access them.
+	 */
 	if (iocb->ki_cred) {
 		current->cred = old_cred;
 		put_cred(iocb->ki_cred);
@@ -1558,15 +1577,6 @@ static void aio_thread_fn(struct work_struct *work)
 		exit_fs(current);
 		current->fs = old_fs;
 	}
-
-	/* Completion serializes cancellation by taking ctx_lock, so
-	 * aio_complete() will not return until after force_sig() in
-	 * aio_thread_queue_iocb_cancel().  This should ensure that
-	 * the signal is pending before being flushed in this thread.
-	 */
-	aio_complete(&iocb->common, ret, 0);
-	if (fatal_signal_pending(current))
-		flush_signals(current);
 }
 
 /* aio_thread_queue_iocb
@@ -1758,11 +1768,6 @@ static long aio_poll(struct aio_kiocb *req, struct iocb *user_iocb, bool compat)
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-static long aio_do_openat(int fd, const char *filename, int flags, int mode)
-{
-	return do_sys_open(fd, filename, flags, mode);
-}
-
 static long aio_do_unlinkat(int fd, const char *filename, int flags, int mode)
 {
 	if (flags || mode)
@@ -1793,14 +1798,91 @@ static long aio_thread_op_foo_at(struct aio_kiocb *req)
 	return ret;
 }
 
+static void openat_destruct(struct aio_kiocb *req)
+{
+	struct filename *filename = req->common.private;
+	int fd;
+
+	putname(filename);
+	fd = req->ki_data;
+	if (fd >= 0)
+		put_unused_fd(fd);
+}
+
+static long aio_thread_op_openat(struct aio_kiocb *req)
+{
+	struct filename *filename = req->common.private;
+	int mode = req->common.ki_pos >> 32;
+	int flags = req->common.ki_pos;
+	struct open_flags op;
+	struct file *f;
+	int dfd = req->ki_data2;
+
+	build_open_flags(flags, mode, &op);
+	f = do_filp_open(dfd, filename, &op);
+	if (!IS_ERR(f)) {
+		int fd = req->ki_data;
+		/* Prevent openat_destruct from doing put_unused_fd() */
+		req->ki_data = -1;
+		fsnotify_open(f);
+		fd_install(fd, f);
+		return fd;
+	}
+	return PTR_ERR(f);
+}
+
 static long aio_openat(struct aio_kiocb *req, struct iocb *uiocb, bool compat)
 {
-	req->ki_data = (unsigned long)(void *)aio_do_openat;
-	return aio_thread_queue_iocb(req, aio_thread_op_foo_at,
-				     AIO_THREAD_NEED_TASK |
-				     AIO_THREAD_NEED_MM |
-				     AIO_THREAD_NEED_FILES |
-				     AIO_THREAD_NEED_CRED);
+	int mode = req->common.ki_pos >> 32;
+	struct filename *filename;
+	struct open_flags op;
+	int flags;
+	int fd;
+
+	if (force_o_largefile())
+		req->common.ki_pos |= O_LARGEFILE;
+	flags = req->common.ki_pos;
+	fd = build_open_flags(flags, mode, &op);
+	if (fd)
+		goto out_err;
+
+	filename = getname((const char __user *)(long)uiocb->aio_buf);
+	if (IS_ERR(filename)) {
+		fd = PTR_ERR(filename);
+		goto out_err;
+	}
+	req->common.private = filename;
+	req->ki_destruct_fn = openat_destruct;
+	req->ki_data = fd = get_unused_fd_flags(flags);
+	if (fd >= 0) {
+		struct file *f;
+		op.lookup_flags |= LOOKUP_RCU | LOOKUP_NONBLOCK;
+		req->ki_data = fd;
+		req->ki_data2 = uiocb->aio_fildes;
+		f = do_filp_open(uiocb->aio_fildes, filename, &op);
+		if (IS_ERR(f) && ((PTR_ERR(f) == -ECHILD) ||
+				  (PTR_ERR(f) == -ESTALE) ||
+				  (PTR_ERR(f) == -EAGAIN))) {
+			int ret;
+			ret = aio_thread_queue_iocb(req, aio_thread_op_openat,
+						   AIO_THREAD_NEED_TASK |
+						   AIO_THREAD_NEED_FILES |
+						   AIO_THREAD_NEED_CRED);
+			if (ret == -EIOCBQUEUED)
+				return ret;
+			put_unused_fd(fd);
+			fd = ret;
+		} else if (IS_ERR(f)) {
+			put_unused_fd(fd);
+			fd = PTR_ERR(f);
+		} else {
+			fsnotify_open(f);
+			fd_install(fd, f);
+		}
+	}
+out_err:
+	aio_complete(&req->common, fd, 0);
+	return -EIOCBQUEUED;
 }
 
 static long aio_unlink(struct aio_kiocb *req, struct iocb *uiocb, bool compt)
diff --git a/fs/internal.h b/fs/internal.h
index 57b6010..c421572 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -102,6 +102,7 @@ struct open_flags {
 	int intent;
 	int lookup_flags;
 };
+extern int build_open_flags(int flags, umode_t mode, struct open_flags *op);
 extern struct file *do_filp_open(int dfd, struct filename *pathname,
 		const struct open_flags *op);
 extern struct file *do_file_open_root(struct dentry *, struct vfsmount *,
diff --git a/fs/namei.c b/fs/namei.c
index 84ecc7e..260782f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3079,6 +3079,12 @@ retry_lookup:
 		 * dropping this one anyway.
 		 */
 	}
+
+	if (nd->flags & LOOKUP_NONBLOCK) {
+		error = -EAGAIN;
+		goto out;
+	}
+		
 	mutex_lock(&dir->d_inode->i_mutex);
 	error = lookup_open(nd, &path, file, op, got_write, opened);
 	mutex_unlock(&dir->d_inode->i_mutex);
@@ -3356,10 +3362,12 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 
 	set_nameidata(&nd, dfd, pathname);
 	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
-	if (unlikely(filp == ERR_PTR(-ECHILD)))
-		filp = path_openat(&nd, op, flags);
-	if (unlikely(filp == ERR_PTR(-ESTALE)))
-		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
+	if (!(op->lookup_flags & LOOKUP_RCU)) {
+		if (unlikely(filp == ERR_PTR(-ECHILD)))
+			filp = path_openat(&nd, op, flags);
+		if (unlikely(filp == ERR_PTR(-ESTALE)))
+			filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
+	}
 	restore_nameidata();
 	return filp;
 }
diff --git a/fs/open.c b/fs/open.c
index b6f1e96..f6a45cb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -884,7 +884,7 @@ struct file *dentry_open(const struct path *path, int flags,
 }
 EXPORT_SYMBOL(dentry_open);
 
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
 {
 	int lookup_flags = 0;
 	int acc_mode;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index d8c6334..1e76579 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -43,6 +43,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_JUMPED		0x1000
 #define LOOKUP_ROOT		0x2000
 #define LOOKUP_EMPTY		0x4000
+#define LOOKUP_NONBLOCK		0x8000
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-03-14 17:17                                   ` Benjamin LaHaise
@ 2016-03-20  1:20                                     ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  1:20 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Linux Kernel Mailing List, linux-aio, linux-mm

On Mon, Mar 14, 2016 at 10:17 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> I had some time last week to make an aio openat do what it can in
> submit context.  The results are an improvement: when openat is handled
> in submit context it completes in about half the time it takes compared
> to the round trip via the work queue, and it's not terribly much code
> either.

This looks good to me, and I do suspect that any of these aio paths
should strive to have a synchronous vs threaded model. I think that
makes the whole thing much more interesting from a performance
standpoint.

I still think the aio interface is really nasty,  but this together
with the table-based approach you posted earlier does make me a _lot_
happier about the implementation.It just looks way less hacky, and now
it ends up exposing a rather more clever implementation too.

             Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  1:20                                     ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  1:20 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, Al Viro, Andrew Morton, linux-fsdevel, Linux API,
	Linux Kernel Mailing List, linux-aio, linux-mm

On Mon, Mar 14, 2016 at 10:17 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> I had some time last week to make an aio openat do what it can in
> submit context.  The results are an improvement: when openat is handled
> in submit context it completes in about half the time it takes compared
> to the round trip via the work queue, and it's not terribly much code
> either.

This looks good to me, and I do suspect that any of these aio paths
should strive to have a synchronous vs threaded model. I think that
makes the whole thing much more interesting from a performance
standpoint.

I still think the aio interface is really nasty,  but this together
with the table-based approach you posted earlier does make me a _lot_
happier about the implementation.It just looks way less hacky, and now
it ends up exposing a rather more clever implementation too.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-03-20  1:20                                     ` Linus Torvalds
  (?)
@ 2016-03-20  1:26                                       ` Al Viro
  -1 siblings, 0 replies; 133+ messages in thread
From: Al Viro @ 2016-03-20  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 06:20:24PM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2016 at 10:17 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > I had some time last week to make an aio openat do what it can in
> > submit context.  The results are an improvement: when openat is handled
> > in submit context it completes in about half the time it takes compared
> > to the round trip via the work queue, and it's not terribly much code
> > either.
> 
> This looks good to me, and I do suspect that any of these aio paths
> should strive to have a synchronous vs threaded model. I think that
> makes the whole thing much more interesting from a performance
> standpoint.

Umm...  You do realize that LOOKUP_RCU in flags does *NOT* guarantee that
it won't block, right?  At the very least one would need to refuse to
fall back on non-RCU mode without a full restart.  Furthermore, vfs_open()
itself can easily block.

So this new LOOKUP flag makes no sense, and it's in the just about _the_
worst place possible for adding special cases with ill-defined semantics -
do_last() is already far too convoluted and needs untangling, not adding
half-assed kludges.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  1:26                                       ` Al Viro
  0 siblings, 0 replies; 133+ messages in thread
From: Al Viro @ 2016-03-20  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 06:20:24PM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2016 at 10:17 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > I had some time last week to make an aio openat do what it can in
> > submit context.  The results are an improvement: when openat is handled
> > in submit context it completes in about half the time it takes compared
> > to the round trip via the work queue, and it's not terribly much code
> > either.
> 
> This looks good to me, and I do suspect that any of these aio paths
> should strive to have a synchronous vs threaded model. I think that
> makes the whole thing much more interesting from a performance
> standpoint.

Umm...  You do realize that LOOKUP_RCU in flags does *NOT* guarantee that
it won't block, right?  At the very least one would need to refuse to
fall back on non-RCU mode without a full restart.  Furthermore, vfs_open()
itself can easily block.

So this new LOOKUP flag makes no sense, and it's in the just about _the_
worst place possible for adding special cases with ill-defined semantics -
do_last() is already far too convoluted and needs untangling, not adding
half-assed kludges.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  1:26                                       ` Al Viro
  0 siblings, 0 replies; 133+ messages in thread
From: Al Viro @ 2016-03-20  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 06:20:24PM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2016 at 10:17 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > I had some time last week to make an aio openat do what it can in
> > submit context.  The results are an improvement: when openat is handled
> > in submit context it completes in about half the time it takes compared
> > to the round trip via the work queue, and it's not terribly much code
> > either.
> 
> This looks good to me, and I do suspect that any of these aio paths
> should strive to have a synchronous vs threaded model. I think that
> makes the whole thing much more interesting from a performance
> standpoint.

Umm...  You do realize that LOOKUP_RCU in flags does *NOT* guarantee that
it won't block, right?  At the very least one would need to refuse to
fall back on non-RCU mode without a full restart.  Furthermore, vfs_open()
itself can easily block.

So this new LOOKUP flag makes no sense, and it's in the just about _the_
worst place possible for adding special cases with ill-defined semantics -
do_last() is already far too convoluted and needs untangling, not adding
half-assed kludges.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-03-20  1:26                                       ` Al Viro
  (?)
@ 2016-03-20  1:45                                         ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  1:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 6:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  You do realize that LOOKUP_RCU in flags does *NOT* guarantee that
> it won't block, right?  At the very least one would need to refuse to
> fall back on non-RCU mode without a full restart.

It actually does seem to do that, although in an admittedly rather
questionable way.

I think it should use path_openat() rather than do_filp_open(), but
passing in LOOKUP_RCU to do_filp_open() actually does work: it just
means that the retry after ECHILD/ESTALE will just do it *again* with
LOOKUP_RCU. It won't fall back to non-rcu mode, it just won't or in
the LOOKUP_RCU flag that is already set.

So I agree that it should be cleaned up, but the basic model seems
fine. I'm sure you're right about do_last() not necessarily being the
best place either. But that doesn't really change that the approach
seems *much* better than the old unconditional "do in a work queue".

Also, the whole "no guarantees of never blocking" is a specious argument.

Just copying the iocb from user space can block. Copying the pathname
likewise (or copying the iovec in the case of reads and writes). So
the aio interface at no point is "guaranteed to never block". Blocking
will happen. You can block on allocating the "struct file", or on
extending the filp table.

In the end it's about _performance_, and if the performance is better
with very unlikely blocking synchronous calls, then that's the right
thing to do.

                   Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  1:45                                         ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  1:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 6:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  You do realize that LOOKUP_RCU in flags does *NOT* guarantee that
> it won't block, right?  At the very least one would need to refuse to
> fall back on non-RCU mode without a full restart.

It actually does seem to do that, although in an admittedly rather
questionable way.

I think it should use path_openat() rather than do_filp_open(), but
passing in LOOKUP_RCU to do_filp_open() actually does work: it just
means that the retry after ECHILD/ESTALE will just do it *again* with
LOOKUP_RCU. It won't fall back to non-rcu mode, it just won't or in
the LOOKUP_RCU flag that is already set.

So I agree that it should be cleaned up, but the basic model seems
fine. I'm sure you're right about do_last() not necessarily being the
best place either. But that doesn't really change that the approach
seems *much* better than the old unconditional "do in a work queue".

Also, the whole "no guarantees of never blocking" is a specious argument.

Just copying the iocb from user space can block. Copying the pathname
likewise (or copying the iovec in the case of reads and writes). So
the aio interface at no point is "guaranteed to never block". Blocking
will happen. You can block on allocating the "struct file", or on
extending the filp table.

In the end it's about _performance_, and if the performance is better
with very unlikely blocking synchronous calls, then that's the right
thing to do.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  1:45                                         ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  1:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, linux-mm

On Sat, Mar 19, 2016 at 6:26 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>
> Umm...  You do realize that LOOKUP_RCU in flags does *NOT* guarantee that
> it won't block, right?  At the very least one would need to refuse to
> fall back on non-RCU mode without a full restart.

It actually does seem to do that, although in an admittedly rather
questionable way.

I think it should use path_openat() rather than do_filp_open(), but
passing in LOOKUP_RCU to do_filp_open() actually does work: it just
means that the retry after ECHILD/ESTALE will just do it *again* with
LOOKUP_RCU. It won't fall back to non-rcu mode, it just won't or in
the LOOKUP_RCU flag that is already set.

So I agree that it should be cleaned up, but the basic model seems
fine. I'm sure you're right about do_last() not necessarily being the
best place either. But that doesn't really change that the approach
seems *much* better than the old unconditional "do in a work queue".

Also, the whole "no guarantees of never blocking" is a specious argument.

Just copying the iocb from user space can block. Copying the pathname
likewise (or copying the iovec in the case of reads and writes). So
the aio interface at no point is "guaranteed to never block". Blocking
will happen. You can block on allocating the "struct file", or on
extending the filp table.

In the end it's about _performance_, and if the performance is better
with very unlikely blocking synchronous calls, then that's the right
thing to do.

                   Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-03-20  1:45                                         ` Linus Torvalds
@ 2016-03-20  1:55                                           ` Al Viro
  -1 siblings, 0 replies; 133+ messages in thread
From: Al Viro @ 2016-03-20  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 06:45:19PM -0700, Linus Torvalds wrote:

> It actually does seem to do that, although in an admittedly rather
> questionable way.
> 
> I think it should use path_openat() rather than do_filp_open(), but
> passing in LOOKUP_RCU to do_filp_open() actually does work: it just
> means that the retry after ECHILD/ESTALE will just do it *again* with
> LOOKUP_RCU. It won't fall back to non-rcu mode, it just won't or in
> the LOOKUP_RCU flag that is already set.

What would make unlazy_walk() fail?  And if it succeeds, you are not
in RCU mode anymore *without* restarting from scratch...

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  1:55                                           ` Al Viro
  0 siblings, 0 replies; 133+ messages in thread
From: Al Viro @ 2016-03-20  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 06:45:19PM -0700, Linus Torvalds wrote:

> It actually does seem to do that, although in an admittedly rather
> questionable way.
> 
> I think it should use path_openat() rather than do_filp_open(), but
> passing in LOOKUP_RCU to do_filp_open() actually does work: it just
> means that the retry after ECHILD/ESTALE will just do it *again* with
> LOOKUP_RCU. It won't fall back to non-rcu mode, it just won't or in
> the LOOKUP_RCU flag that is already set.

What would make unlazy_walk() fail?  And if it succeeds, you are not
in RCU mode anymore *without* restarting from scratch...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
  2016-03-20  1:55                                           ` Al Viro
  (?)
@ 2016-03-20  2:03                                             ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  2:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 6:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What would make unlazy_walk() fail?  And if it succeeds, you are not
> in RCU mode anymore *without* restarting from scratch...

I don't see your point.

You don't want to be in RCU mode any more. You want to either succeed
or fail with ECHILD/ESTALE. Then, in the failure case, you go to the
thread.

What I meant by restarting was the restart that do_filp_open() does,
and there it just restarts with "op->lookup_flags", which has
RCU_LOOKUP still set, so it would just try to do the RCU lookup again.

But I actually notice now that Ben actually disabled that restart if
LOOKUP_RCU was set, so that ends up not even happening.

Anyway, I'm not saying it's polished and pretty. I think the changes
to do_filp_open() are a bit silly, and the code should just use
path_openat() directly. Possibly using a new helper (ie perhaps just
introduce a "rcu_filp_openat()" thing). But from a design perspective,
I think this all looks fine.

              Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  2:03                                             ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  2:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 6:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What would make unlazy_walk() fail?  And if it succeeds, you are not
> in RCU mode anymore *without* restarting from scratch...

I don't see your point.

You don't want to be in RCU mode any more. You want to either succeed
or fail with ECHILD/ESTALE. Then, in the failure case, you go to the
thread.

What I meant by restarting was the restart that do_filp_open() does,
and there it just restarts with "op->lookup_flags", which has
RCU_LOOKUP still set, so it would just try to do the RCU lookup again.

But I actually notice now that Ben actually disabled that restart if
LOOKUP_RCU was set, so that ends up not even happening.

Anyway, I'm not saying it's polished and pretty. I think the changes
to do_filp_open() are a bit silly, and the code should just use
path_openat() directly. Possibly using a new helper (ie perhaps just
introduce a "rcu_filp_openat()" thing). But from a design perspective,
I think this all looks fine.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync
@ 2016-03-20  2:03                                             ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2016-03-20  2:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Benjamin LaHaise, Dave Chinner, Andrew Morton, linux-fsdevel,
	Linux API, Linux Kernel Mailing List, linux-aio, linux-mm

On Sat, Mar 19, 2016 at 6:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What would make unlazy_walk() fail?  And if it succeeds, you are not
> in RCU mode anymore *without* restarting from scratch...

I don't see your point.

You don't want to be in RCU mode any more. You want to either succeed
or fail with ECHILD/ESTALE. Then, in the failure case, you go to the
thread.

What I meant by restarting was the restart that do_filp_open() does,
and there it just restarts with "op->lookup_flags", which has
RCU_LOOKUP still set, so it would just try to do the RCU lookup again.

But I actually notice now that Ben actually disabled that restart if
LOOKUP_RCU was set, so that ends up not even happening.

Anyway, I'm not saying it's polished and pretty. I think the changes
to do_filp_open() are a bit silly, and the code should just use
path_openat() directly. Possibly using a new helper (ie perhaps just
introduce a "rcu_filp_openat()" thing). But from a design perspective,
I think this all looks fine.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2016-03-20  2:03 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-11 22:06 [PATCH 00/13] aio: thread (work queue) based aio and new aio functionality Benjamin LaHaise
2016-01-11 22:06 ` Benjamin LaHaise
2016-01-11 22:06 ` [PATCH 01/13] signals: distinguish signals sent due to i/o via io_send_sig() Benjamin LaHaise
2016-01-11 22:06   ` Benjamin LaHaise
2016-01-11 22:06   ` Benjamin LaHaise
2016-01-11 22:06 ` [PATCH 02/13] aio: add aio_get_mm() helper Benjamin LaHaise
2016-01-11 22:06   ` Benjamin LaHaise
2016-01-11 22:06   ` Benjamin LaHaise
2016-01-11 22:06 ` [PATCH 03/13] aio: for async operations, make the iter argument persistent Benjamin LaHaise
2016-01-11 22:06   ` Benjamin LaHaise
2016-01-11 22:06   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 04/13] signals: add and use aio_get_task() to direct signals sent via io_send_sig() Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 05/13] fs: make do_loop_readv_writev() non-static Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 06/13] aio: add queue_work() based threaded aio support Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 07/13] aio: enabled thread based async fsync Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-12  1:11   ` Dave Chinner
2016-01-12  1:11     ` Dave Chinner
2016-01-12  1:20     ` Linus Torvalds
2016-01-12  1:20       ` Linus Torvalds
2016-01-12  2:25       ` Dave Chinner
2016-01-12  2:25         ` Dave Chinner
2016-01-12  2:25         ` Dave Chinner
2016-01-12  2:38         ` Linus Torvalds
2016-01-12  2:38           ` Linus Torvalds
2016-01-12  3:37           ` Dave Chinner
2016-01-12  3:37             ` Dave Chinner
2016-01-12  4:03             ` Linus Torvalds
2016-01-12  4:03               ` Linus Torvalds
2016-01-12  4:48               ` Linus Torvalds
2016-01-12  4:48                 ` Linus Torvalds
2016-01-12 22:50                 ` Benjamin LaHaise
2016-01-12 22:50                   ` Benjamin LaHaise
2016-01-12 22:50                   ` Benjamin LaHaise
2016-01-15 20:21                 ` Benjamin LaHaise
2016-01-15 20:21                   ` Benjamin LaHaise
2016-01-15 20:21                   ` Benjamin LaHaise
2016-01-20  3:59                   ` Linus Torvalds
2016-01-20  3:59                     ` Linus Torvalds
2016-01-20  3:59                     ` Linus Torvalds
2016-01-20  5:02                     ` Theodore Ts'o
2016-01-20  5:02                       ` Theodore Ts'o
2016-01-20  5:02                       ` Theodore Ts'o
2016-01-20 19:59                     ` Dave Chinner
2016-01-20 19:59                       ` Dave Chinner
2016-01-20 19:59                       ` Dave Chinner
2016-01-20 20:29                       ` Linus Torvalds
2016-01-20 20:29                         ` Linus Torvalds
2016-01-20 20:44                         ` Benjamin LaHaise
2016-01-20 20:44                           ` Benjamin LaHaise
2016-01-20 20:44                           ` Benjamin LaHaise
2016-01-20 21:45                           ` Dave Chinner
2016-01-20 21:45                             ` Dave Chinner
2016-01-20 21:56                             ` Benjamin LaHaise
2016-01-20 21:56                               ` Benjamin LaHaise
2016-01-20 21:56                               ` Benjamin LaHaise
2016-01-23  4:24                               ` Dave Chinner
2016-01-23  4:24                                 ` Dave Chinner
2016-01-23  4:50                                 ` Benjamin LaHaise
2016-01-23  4:50                                   ` Benjamin LaHaise
2016-01-23  4:50                                   ` Benjamin LaHaise
2016-01-23 22:22                                   ` Dave Chinner
2016-01-23 22:22                                     ` Dave Chinner
2016-01-23 22:22                                     ` Dave Chinner
2016-01-20 23:07                             ` Linus Torvalds
2016-01-23  4:39                               ` Dave Chinner
2016-01-23  4:39                                 ` Dave Chinner
2016-01-23  4:39                                 ` Dave Chinner
2016-03-14 17:17                                 ` aio openat " Benjamin LaHaise
2016-03-14 17:17                                   ` Benjamin LaHaise
2016-03-20  1:20                                   ` Linus Torvalds
2016-03-20  1:20                                     ` Linus Torvalds
2016-03-20  1:26                                     ` Al Viro
2016-03-20  1:26                                       ` Al Viro
2016-03-20  1:26                                       ` Al Viro
2016-03-20  1:45                                       ` Linus Torvalds
2016-03-20  1:45                                         ` Linus Torvalds
2016-03-20  1:45                                         ` Linus Torvalds
2016-03-20  1:55                                         ` Al Viro
2016-03-20  1:55                                           ` Al Viro
2016-03-20  2:03                                           ` Linus Torvalds
2016-03-20  2:03                                             ` Linus Torvalds
2016-03-20  2:03                                             ` Linus Torvalds
2016-01-20 21:57                         ` Dave Chinner
2016-01-20 21:57                           ` Dave Chinner
2016-01-20 21:57                           ` Dave Chinner
2016-01-22 15:41                     ` Andres Freund
2016-01-22 15:41                       ` Andres Freund
2016-01-12 22:59               ` Andy Lutomirski
2016-01-12 22:59                 ` Andy Lutomirski
2016-01-12 22:59                 ` Andy Lutomirski
2016-01-14  9:19       ` Paolo Bonzini
2016-01-14  9:19         ` Paolo Bonzini
2016-01-14  9:19         ` Paolo Bonzini
2016-01-12  1:30     ` Benjamin LaHaise
2016-01-12  1:30       ` Benjamin LaHaise
2016-01-12  1:30       ` Benjamin LaHaise
2016-01-22 15:31     ` Andres Freund
2016-01-22 15:31       ` Andres Freund
2016-01-22 15:31       ` Andres Freund
2016-01-11 22:07 ` [PATCH 08/13] aio: add support for aio poll via aio thread helper Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 09/13] aio: add support for async openat() Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-12  0:22   ` Linus Torvalds
2016-01-12  0:22     ` Linus Torvalds
2016-01-12  1:17     ` Benjamin LaHaise
2016-01-12  1:17       ` Benjamin LaHaise
2016-01-12  1:17       ` Benjamin LaHaise
2016-01-12  1:45     ` Chris Mason
2016-01-12  1:45       ` Chris Mason
2016-01-12  1:45       ` Chris Mason
2016-01-12  9:53     ` Ingo Molnar
2016-01-12  9:53       ` Ingo Molnar
2016-01-12  9:53       ` Ingo Molnar
2016-01-11 22:07 ` [PATCH 10/13] aio: add async unlinkat functionality Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 11/13] mm: enable __do_page_cache_readahead() to include present pages Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:07 ` [PATCH 12/13] aio: add support for aio readahead Benjamin LaHaise
2016-01-11 22:07   ` Benjamin LaHaise
2016-01-11 22:08 ` [PATCH 13/13] aio: add support for aio renameat operation Benjamin LaHaise
2016-01-11 22:08   ` Benjamin LaHaise
2016-01-11 22:08   ` Benjamin LaHaise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.