[PATCH v1 0/9] block & aio: kernel aio and loop mq conversion

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
@ 2014-08-14 15:50 ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

Hi,

The 1st two patches introduce kernel AIO support, most of
is borrowed from Dave's work last year, and thanks to ITER_BVEC,
it is much simper to implement kernel AIO now. AIO model
is quite suitable for implementing kernel stuff, and can
help improve both throughput and CPU utilization. Lots of
kernel components should benefit from it, such as:
	- loop driver,
	- all kinds of kernel I/O target driver(SCSI, USB storage
	or UAS, ...)
	- kernel socket users might benefit from it too if socket
	AIO is mature

The following 6 patches convert current loop driver into blk-mq:
	- loop's scalability gets improved much
	- loop driver gets quite simplified, and the conversion can
	be throught as cleanup too

The 9th patch uses kernel AIO with O_DIRECT to improve loop's
performance in single job situation, and avoid double cache
issue for loop driver too.

With the change, loop block's performance can be doubled in my
fio test(randread, single job, libaio). If more fio jobs are used,
the throughput can be improved much more because of blk-mq.

Given loop is used quite widely, especially in VM environment,
also the change is quite small, hope it can be merged finally.

V1:
	- improve failure path in aio_kernel_submit()


 block/blk-mq.c            |   11 +-
 block/blk-mq.h            |    1 -
 drivers/block/loop.c      |  474 +++++++++++++++++++++++++--------------------
 drivers/block/loop.h      |   15 +-
 fs/aio.c                  |  121 ++++++++++++
 fs/direct-io.c            |    9 +-
 include/linux/aio.h       |   15 +-
 include/linux/blk-mq.h    |   13 ++
 include/uapi/linux/loop.h |    1 +
 9 files changed, 439 insertions(+), 221 deletions(-)


Thanks,
--
Ming Lei



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
@ 2014-08-14 15:50 ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

Hi,

The 1st two patches introduce kernel AIO support, most of
is borrowed from Dave's work last year, and thanks to ITER_BVEC,
it is much simper to implement kernel AIO now. AIO model
is quite suitable for implementing kernel stuff, and can
help improve both throughput and CPU utilization. Lots of
kernel components should benefit from it, such as:
	- loop driver,
	- all kinds of kernel I/O target driver(SCSI, USB storage
	or UAS, ...)
	- kernel socket users might benefit from it too if socket
	AIO is mature

The following 6 patches convert current loop driver into blk-mq:
	- loop's scalability gets improved much
	- loop driver gets quite simplified, and the conversion can
	be throught as cleanup too

The 9th patch uses kernel AIO with O_DIRECT to improve loop's
performance in single job situation, and avoid double cache
issue for loop driver too.

With the change, loop block's performance can be doubled in my
fio test(randread, single job, libaio). If more fio jobs are used,
the throughput can be improved much more because of blk-mq.

Given loop is used quite widely, especially in VM environment,
also the change is quite small, hope it can be merged finally.

V1:
	- improve failure path in aio_kernel_submit()


 block/blk-mq.c            |   11 +-
 block/blk-mq.h            |    1 -
 drivers/block/loop.c      |  474 +++++++++++++++++++++++++--------------------
 drivers/block/loop.h      |   15 +-
 fs/aio.c                  |  121 ++++++++++++
 fs/direct-io.c            |    9 +-
 include/linux/aio.h       |   15 +-
 include/linux/blk-mq.h    |   13 ++
 include/uapi/linux/loop.h |    1 +
 9 files changed, 439 insertions(+), 221 deletions(-)


Thanks,
--
Ming Lei


--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v1 1/9] aio: add aio_kernel_() interface
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Alexander Viro, Ming Lei

From: Dave Kleikamp <dave.kleikamp@oracle.com>

This adds an interface that lets kernel callers submit aio iocbs without
going through the user space syscalls.  This lets kernel callers avoid
the management limits and overhead of the context.  It will also let us
integrate aio operations with other kernel apis that the user space
interface doesn't have access to.

This patch is based on Dave's posts in below links:

	https://lkml.org/lkml/2013/10/16/365
	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ

And most of the patch is from Dave's directly, and follows potential
users of this APIs:
	- Loop block driver for avoiding double cache, and improving
	  throughput
	- all kinds of kernel target(SCSI, USB, ...) which need to access
	file efficiently
	- all kinds of socket users if socket aio is mature

Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-aio@kvack.org (open list:AIO)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/aio.c            |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   15 ++++++-
 2 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index ae63587..6437cb8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -941,6 +941,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 		iocb->ki_ctx = ERR_PTR(-EXDEV);
 		wake_up_process(iocb->ki_obj.tsk);
 		return;
+	} else if (is_kernel_kiocb(iocb)) {
+		iocb->ki_obj.complete(iocb->ki_user_data, res);
+		return;
 	}
 
 	if (iocb->ki_list.next) {
@@ -1387,6 +1390,124 @@ rw_common:
 	return 0;
 }
 
+/*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.
+ *
+ * The caller is expected to call the appropriate aio_kernel_init_() functions
+ * and then call aio_kernel_submit().  From that point forward progress is
+ * guaranteed by the file system aio method.  Eventually the caller's
+ * completion callback will be called.
+ *
+ * These iocbs are special.  They don't have a context, we don't limit the
+ * number pending, and they can't be canceled.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra)
+{
+	return kzalloc(sizeof(struct kiocb) + extra, gfp);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+	kfree(iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+			size_t nr, loff_t off,
+			void (*complete)(u64 user_data, long res),
+			u64 user_data)
+{
+	iocb->ki_filp = filp;
+	iocb->ki_nbytes = nr;
+	iocb->ki_pos = off;
+	iocb->ki_ctx = (void *)-1;
+
+	iocb->ki_obj.complete = complete;
+	iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+static ssize_t aio_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_READ);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->read_iter)
+		return -EINVAL;
+
+	return file->f_op->read_iter(iocb, iter);
+}
+
+static ssize_t aio_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_WRITE);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->write_iter)
+		return -EINVAL;
+
+	file_start_write(file);
+	ret = file->f_op->write_iter(iocb, iter);
+	file_end_write(file);
+
+	return ret;
+}
+
+/*
+ * The iocb is our responsibility once this is called.  The caller must not
+ * reference it.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function.  The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb, bool is_write,
+		      struct iov_iter *iter)
+{
+	int ret = -EINVAL;
+
+	if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
+			|| !iocb->ki_filp || !(iter->type & ITER_BVEC)))
+		return ret;
+
+	if (is_write)
+		ret = aio_write_iter(iocb, iter);
+	else
+		ret = aio_read_iter(iocb, iter);
+
+	/*
+	 * use same policy with userspace aio, req may have been
+	 * completed already, so release it by aio completion.
+	 */
+	if (ret != -EIOCBQUEUED)
+		iocb->ki_obj.complete(iocb->ki_user_data, ret);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..4c27d42 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -31,13 +31,15 @@ typedef int (kiocb_cancel_fn)(struct kiocb *);
 
 struct kiocb {
 	struct file		*ki_filp;
-	struct kioctx		*ki_ctx;	/* NULL for sync ops */
+	struct kioctx		*ki_ctx;	/* NULL for sync ops,
+						 * -1 for kernel caller */
 	kiocb_cancel_fn		*ki_cancel;
 	void			*private;
 
 	union {
 		void __user		*user;
 		struct task_struct	*tsk;
+		void			(*complete)(u64 user_data, long res);
 	} ki_obj;
 
 	__u64			ki_user_data;	/* user's data for completion */
@@ -59,6 +61,11 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
 	return kiocb->ki_ctx == NULL;
 }
 
+static inline bool is_kernel_kiocb(struct kiocb *kiocb)
+{
+	return kiocb->ki_ctx == (void *)-1;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
@@ -77,6 +84,12 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp, size_t nr,
+			loff_t off, void (*complete)(u64 user_data, long res),
+			u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb, bool is_write, struct iov_iter *iter);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 1/9] aio: add aio_kernel_() interface
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Alexander Viro, Ming Lei

From: Dave Kleikamp <dave.kleikamp@oracle.com>

This adds an interface that lets kernel callers submit aio iocbs without
going through the user space syscalls.  This lets kernel callers avoid
the management limits and overhead of the context.  It will also let us
integrate aio operations with other kernel apis that the user space
interface doesn't have access to.

This patch is based on Dave's posts in below links:

	https://lkml.org/lkml/2013/10/16/365
	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ

And most of the patch is from Dave's directly, and follows potential
users of this APIs:
	- Loop block driver for avoiding double cache, and improving
	  throughput
	- all kinds of kernel target(SCSI, USB, ...) which need to access
	file efficiently
	- all kinds of socket users if socket aio is mature

Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-aio@kvack.org (open list:AIO)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/aio.c            |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   15 ++++++-
 2 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index ae63587..6437cb8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -941,6 +941,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 		iocb->ki_ctx = ERR_PTR(-EXDEV);
 		wake_up_process(iocb->ki_obj.tsk);
 		return;
+	} else if (is_kernel_kiocb(iocb)) {
+		iocb->ki_obj.complete(iocb->ki_user_data, res);
+		return;
 	}
 
 	if (iocb->ki_list.next) {
@@ -1387,6 +1390,124 @@ rw_common:
 	return 0;
 }
 
+/*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.
+ *
+ * The caller is expected to call the appropriate aio_kernel_init_() functions
+ * and then call aio_kernel_submit().  From that point forward progress is
+ * guaranteed by the file system aio method.  Eventually the caller's
+ * completion callback will be called.
+ *
+ * These iocbs are special.  They don't have a context, we don't limit the
+ * number pending, and they can't be canceled.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra)
+{
+	return kzalloc(sizeof(struct kiocb) + extra, gfp);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+	kfree(iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+			size_t nr, loff_t off,
+			void (*complete)(u64 user_data, long res),
+			u64 user_data)
+{
+	iocb->ki_filp = filp;
+	iocb->ki_nbytes = nr;
+	iocb->ki_pos = off;
+	iocb->ki_ctx = (void *)-1;
+
+	iocb->ki_obj.complete = complete;
+	iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+static ssize_t aio_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_READ);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->read_iter)
+		return -EINVAL;
+
+	return file->f_op->read_iter(iocb, iter);
+}
+
+static ssize_t aio_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_WRITE);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->write_iter)
+		return -EINVAL;
+
+	file_start_write(file);
+	ret = file->f_op->write_iter(iocb, iter);
+	file_end_write(file);
+
+	return ret;
+}
+
+/*
+ * The iocb is our responsibility once this is called.  The caller must not
+ * reference it.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function.  The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb, bool is_write,
+		      struct iov_iter *iter)
+{
+	int ret = -EINVAL;
+
+	if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
+			|| !iocb->ki_filp || !(iter->type & ITER_BVEC)))
+		return ret;
+
+	if (is_write)
+		ret = aio_write_iter(iocb, iter);
+	else
+		ret = aio_read_iter(iocb, iter);
+
+	/*
+	 * use same policy with userspace aio, req may have been
+	 * completed already, so release it by aio completion.
+	 */
+	if (ret != -EIOCBQUEUED)
+		iocb->ki_obj.complete(iocb->ki_user_data, ret);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..4c27d42 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -31,13 +31,15 @@ typedef int (kiocb_cancel_fn)(struct kiocb *);
 
 struct kiocb {
 	struct file		*ki_filp;
-	struct kioctx		*ki_ctx;	/* NULL for sync ops */
+	struct kioctx		*ki_ctx;	/* NULL for sync ops,
+						 * -1 for kernel caller */
 	kiocb_cancel_fn		*ki_cancel;
 	void			*private;
 
 	union {
 		void __user		*user;
 		struct task_struct	*tsk;
+		void			(*complete)(u64 user_data, long res);
 	} ki_obj;
 
 	__u64			ki_user_data;	/* user's data for completion */
@@ -59,6 +61,11 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
 	return kiocb->ki_ctx == NULL;
 }
 
+static inline bool is_kernel_kiocb(struct kiocb *kiocb)
+{
+	return kiocb->ki_ctx == (void *)-1;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
@@ -77,6 +84,12 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp, size_t nr,
+			loff_t off, void (*complete)(u64 user_data, long res),
+			u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb, bool is_write, struct iov_iter *iter);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 2/9] fd/direct-io: introduce should_dirty for kernel aio
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei, Alexander Viro

For pages from kernel AIO, it is required to dirty
them before direct IO.

The idea is borrowd from Dave previous post.

Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/direct-io.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index c311640..f0460e0 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -121,6 +121,7 @@ struct dio {
 	int page_errors;		/* errno from get_user_pages() */
 	int is_async;			/* is IO async ? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
+	bool should_dirty;		/* if page should be dirty */
 	int io_error;			/* IO error in completion path */
 	unsigned long refcount;		/* direct_io_worker() and bios */
 	struct bio *bio_list;		/* singly linked via bi_private */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-	if (dio->is_async && dio->rw == READ)
+	if (dio->is_async && dio->rw == READ && dio->should_dirty)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
@@ -463,13 +464,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (!uptodate)
 		dio->io_error = -EIO;
 
-	if (dio->is_async && dio->rw == READ) {
+	if (dio->is_async && dio->rw == READ && dio->should_dirty) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		bio_for_each_segment_all(bvec, bio, i) {
 			struct page *page = bvec->bv_page;
 
-			if (dio->rw == READ && !PageCompound(page))
+			if (dio->rw == READ && !PageCompound(page) &&
+					dio->should_dirty)
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
@@ -1217,6 +1219,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
+	dio->should_dirty = !is_kernel_kiocb(iocb);
 
 	sdio.iter = iter;
 	sdio.final_block_in_request =
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 2/9] fd/direct-io: introduce should_dirty for kernel aio
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei, Alexander Viro

For pages from kernel AIO, it is required to dirty
them before direct IO.

The idea is borrowd from Dave previous post.

Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/direct-io.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index c311640..f0460e0 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -121,6 +121,7 @@ struct dio {
 	int page_errors;		/* errno from get_user_pages() */
 	int is_async;			/* is IO async ? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
+	bool should_dirty;		/* if page should be dirty */
 	int io_error;			/* IO error in completion path */
 	unsigned long refcount;		/* direct_io_worker() and bios */
 	struct bio *bio_list;		/* singly linked via bi_private */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-	if (dio->is_async && dio->rw == READ)
+	if (dio->is_async && dio->rw == READ && dio->should_dirty)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
@@ -463,13 +464,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (!uptodate)
 		dio->io_error = -EIO;
 
-	if (dio->is_async && dio->rw == READ) {
+	if (dio->is_async && dio->rw == READ && dio->should_dirty) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		bio_for_each_segment_all(bvec, bio, i) {
 			struct page *page = bvec->bv_page;
 
-			if (dio->rw == READ && !PageCompound(page))
+			if (dio->rw == READ && !PageCompound(page) &&
+					dio->should_dirty)
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
@@ -1217,6 +1219,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
+	dio->should_dirty = !is_kernel_kiocb(iocb);
 
 	sdio.iter = iter;
 	sdio.final_block_in_request =
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 3/9] blk-mq: export blk_mq_freeze_queue and blk_mq_unfreeze_queue
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

It is handy to use the two helpers for switching backend file
in loop driver, so export them.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/blk-mq.c         |    4 +++-
 block/blk-mq.h         |    1 -
 include/linux/blk-mq.h |    2 ++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5189cb1..07e1033 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -120,8 +120,9 @@ void blk_mq_freeze_queue(struct request_queue *q)
 	blk_mq_run_queues(q, false);
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
 }
+EXPORT_SYMBOL(blk_mq_freeze_queue);
 
-static void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	bool wake = false;
 
@@ -134,6 +135,7 @@ static void blk_mq_unfreeze_queue(struct request_queue *q)
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
+EXPORT_SYMBOL(blk_mq_unfreeze_queue);
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ca4964a..3800316 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -28,7 +28,6 @@ struct blk_mq_ctx {
 void __blk_mq_complete_request(struct request *rq);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_init_flush(struct request_queue *q);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 void blk_mq_clone_flush_request(struct request *flush_rq,
 		struct request *orig_rq);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index eb726b9..1cc1baa 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -175,6 +175,8 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, void (*fn)(void *data, unsigned long *), void *data);
+void blk_mq_freeze_queue(struct request_queue *q);
+void blk_mq_unfreeze_queue(struct request_queue *q);
 
 /*
  * Driver command data is immediately after the request. So subtract request
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 3/9] blk-mq: export blk_mq_freeze_queue and blk_mq_unfreeze_queue
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

It is handy to use the two helpers for switching backend file
in loop driver, so export them.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/blk-mq.c         |    4 +++-
 block/blk-mq.h         |    1 -
 include/linux/blk-mq.h |    2 ++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5189cb1..07e1033 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -120,8 +120,9 @@ void blk_mq_freeze_queue(struct request_queue *q)
 	blk_mq_run_queues(q, false);
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
 }
+EXPORT_SYMBOL(blk_mq_freeze_queue);
 
-static void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	bool wake = false;
 
@@ -134,6 +135,7 @@ static void blk_mq_unfreeze_queue(struct request_queue *q)
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
+EXPORT_SYMBOL(blk_mq_unfreeze_queue);
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ca4964a..3800316 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -28,7 +28,6 @@ struct blk_mq_ctx {
 void __blk_mq_complete_request(struct request *rq);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_init_flush(struct request_queue *q);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 void blk_mq_clone_flush_request(struct request *flush_rq,
 		struct request *orig_rq);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index eb726b9..1cc1baa 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -175,6 +175,8 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, void (*fn)(void *data, unsigned long *), void *data);
+void blk_mq_freeze_queue(struct request_queue *q);
+void blk_mq_unfreeze_queue(struct request_queue *q);
 
 /*
  * Driver command data is immediately after the request. So subtract request
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

Currently pdu of the flush rq is simlpy copied from another rq,
it isn't enough to initialize pointer field well, so introduce
the callback for driver to handle the case easily.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/blk-mq.c         |    7 +++++++
 include/linux/blk-mq.h |   11 +++++++++++
 2 files changed, 18 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 07e1033..295cfc8d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -285,11 +285,18 @@ void blk_mq_clone_flush_request(struct request *flush_rq,
 {
 	struct blk_mq_hw_ctx *hctx =
 		orig_rq->q->mq_ops->map_queue(orig_rq->q, orig_rq->mq_ctx->cpu);
+	int ret = 0;
 
 	flush_rq->mq_ctx = orig_rq->mq_ctx;
 	flush_rq->tag = orig_rq->tag;
 	memcpy(blk_mq_rq_to_pdu(flush_rq), blk_mq_rq_to_pdu(orig_rq),
 		hctx->cmd_size);
+
+	if (orig_rq->q->mq_ops->init_flush_rq)
+		ret = orig_rq->q->mq_ops->init_flush_rq(
+				orig_rq->q->tag_set->driver_data,
+				orig_rq->q, flush_rq, orig_rq);
+	WARN_ON(ret);
 }
 
 inline void __blk_mq_end_io(struct request *rq, int error)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1cc1baa..df000d4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -85,6 +85,8 @@ typedef int (init_request_fn)(void *, struct request *, unsigned int,
 		unsigned int, unsigned int);
 typedef void (exit_request_fn)(void *, struct request *, unsigned int,
 		unsigned int);
+typedef int (init_flush_rq_fn)(void *, struct request_queue *,
+		struct request *, const struct request *);
 
 struct blk_mq_ops {
 	/*
@@ -119,6 +121,15 @@ struct blk_mq_ops {
 	 */
 	init_request_fn		*init_request;
 	exit_request_fn		*exit_request;
+
+	/*
+	 * Called after the flush rq is cloned, pointer can't be
+	 * initialized well with simply memcpy(), so driver can
+	 * use the callback to reinitialize flush rq. The 3rd
+	 * parameter is the flush rq which has been cloned/copied
+	 * from another rq(the 4th parameter).
+	 */
+	init_flush_rq_fn	*init_flush_rq;
 };
 
 enum {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

Currently pdu of the flush rq is simlpy copied from another rq,
it isn't enough to initialize pointer field well, so introduce
the callback for driver to handle the case easily.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 block/blk-mq.c         |    7 +++++++
 include/linux/blk-mq.h |   11 +++++++++++
 2 files changed, 18 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 07e1033..295cfc8d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -285,11 +285,18 @@ void blk_mq_clone_flush_request(struct request *flush_rq,
 {
 	struct blk_mq_hw_ctx *hctx =
 		orig_rq->q->mq_ops->map_queue(orig_rq->q, orig_rq->mq_ctx->cpu);
+	int ret = 0;
 
 	flush_rq->mq_ctx = orig_rq->mq_ctx;
 	flush_rq->tag = orig_rq->tag;
 	memcpy(blk_mq_rq_to_pdu(flush_rq), blk_mq_rq_to_pdu(orig_rq),
 		hctx->cmd_size);
+
+	if (orig_rq->q->mq_ops->init_flush_rq)
+		ret = orig_rq->q->mq_ops->init_flush_rq(
+				orig_rq->q->tag_set->driver_data,
+				orig_rq->q, flush_rq, orig_rq);
+	WARN_ON(ret);
 }
 
 inline void __blk_mq_end_io(struct request *rq, int error)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1cc1baa..df000d4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -85,6 +85,8 @@ typedef int (init_request_fn)(void *, struct request *, unsigned int,
 		unsigned int, unsigned int);
 typedef void (exit_request_fn)(void *, struct request *, unsigned int,
 		unsigned int);
+typedef int (init_flush_rq_fn)(void *, struct request_queue *,
+		struct request *, const struct request *);
 
 struct blk_mq_ops {
 	/*
@@ -119,6 +121,15 @@ struct blk_mq_ops {
 	 */
 	init_request_fn		*init_request;
 	exit_request_fn		*exit_request;
+
+	/*
+	 * Called after the flush rq is cloned, pointer can't be
+	 * initialized well with simply memcpy(), so driver can
+	 * use the callback to reinitialize flush rq. The 3rd
+	 * parameter is the flush rq which has been cloned/copied
+	 * from another rq(the 4th parameter).
+	 */
+	init_flush_rq_fn	*init_flush_rq;
 };
 
 enum {
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

The conversion is a bit straightforward, and use work queue to
dispatch reqests of loop block, so scalability gets improved a
lot, and also thoughput is increased a lot in case of concurrent
I/O requests.

Another benefit is that loop driver code gets simplified
much, and the patch can be thought as cleanup too.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |  294 ++++++++++++++++++++++----------------------------
 drivers/block/loop.h |   14 +--
 2 files changed, 137 insertions(+), 171 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6cb1beb..1af5265 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -75,6 +75,7 @@
 #include <linux/sysfs.h>
 #include <linux/miscdevice.h>
 #include <linux/falloc.h>
+#include <linux/blk-mq.h>
 #include "loop.h"
 
 #include <asm/uaccess.h>
@@ -85,6 +86,8 @@ static DEFINE_MUTEX(loop_index_mutex);
 static int max_part;
 static int part_shift;
 
+static struct workqueue_struct *loop_wq;
+
 /*
  * Transfer functions
  */
@@ -466,109 +469,37 @@ out:
 	return ret;
 }
 
-/*
- * Add bio to back of pending list
- */
-static void loop_add_bio(struct loop_device *lo, struct bio *bio)
-{
-	lo->lo_bio_count++;
-	bio_list_add(&lo->lo_bio_list, bio);
-}
-
-/*
- * Grab first pending buffer
- */
-static struct bio *loop_get_bio(struct loop_device *lo)
-{
-	lo->lo_bio_count--;
-	return bio_list_pop(&lo->lo_bio_list);
-}
-
-static void loop_make_request(struct request_queue *q, struct bio *old_bio)
-{
-	struct loop_device *lo = q->queuedata;
-	int rw = bio_rw(old_bio);
-
-	if (rw == READA)
-		rw = READ;
-
-	BUG_ON(!lo || (rw != READ && rw != WRITE));
-
-	spin_lock_irq(&lo->lo_lock);
-	if (lo->lo_state != Lo_bound)
-		goto out;
-	if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
-		goto out;
-	if (lo->lo_bio_count >= q->nr_congestion_on)
-		wait_event_lock_irq(lo->lo_req_wait,
-				    lo->lo_bio_count < q->nr_congestion_off,
-				    lo->lo_lock);
-	loop_add_bio(lo, old_bio);
-	wake_up(&lo->lo_event);
-	spin_unlock_irq(&lo->lo_lock);
-	return;
-
-out:
-	spin_unlock_irq(&lo->lo_lock);
-	bio_io_error(old_bio);
-}
-
 struct switch_request {
 	struct file *file;
 	struct completion wait;
 };
 
-static void do_loop_switch(struct loop_device *, struct switch_request *);
-
-static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
+static inline int loop_handle_bio(struct loop_device *lo, struct bio *bio)
 {
-	if (unlikely(!bio->bi_bdev)) {
-		do_loop_switch(lo, bio->bi_private);
-		bio_put(bio);
-	} else {
-		int ret = do_bio_filebacked(lo, bio);
-		bio_endio(bio, ret);
-	}
+	int ret = do_bio_filebacked(lo, bio);
+	return ret;
 }
 
 /*
- * worker thread that handles reads/writes to file backed loop devices,
- * to avoid blocking in our make_request_fn. it also does loop decrypting
- * on reads for block backed loop, as that is too heavy to do from
- * b_end_io context where irqs may be disabled.
- *
- * Loop explanation:  loop_clr_fd() sets lo_state to Lo_rundown before
- * calling kthread_stop().  Therefore once kthread_should_stop() is
- * true, make_request will not place any more requests.  Therefore
- * once kthread_should_stop() is true and lo_bio is NULL, we are
- * done with the loop.
+ * Do the actual switch; called from the BIO completion routine
  */
-static int loop_thread(void *data)
+static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
 {
-	struct loop_device *lo = data;
-	struct bio *bio;
-
-	set_user_nice(current, MIN_NICE);
-
-	while (!kthread_should_stop() || !bio_list_empty(&lo->lo_bio_list)) {
-
-		wait_event_interruptible(lo->lo_event,
-				!bio_list_empty(&lo->lo_bio_list) ||
-				kthread_should_stop());
-
-		if (bio_list_empty(&lo->lo_bio_list))
-			continue;
-		spin_lock_irq(&lo->lo_lock);
-		bio = loop_get_bio(lo);
-		if (lo->lo_bio_count < lo->lo_queue->nr_congestion_off)
-			wake_up(&lo->lo_req_wait);
-		spin_unlock_irq(&lo->lo_lock);
+	struct file *file = p->file;
+	struct file *old_file = lo->lo_backing_file;
+	struct address_space *mapping;
 
-		BUG_ON(!bio);
-		loop_handle_bio(lo, bio);
-	}
+	/* if no new file, only flush of queued bios requested */
+	if (!file)
+		return;
 
-	return 0;
+	mapping = file->f_mapping;
+	mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
+	lo->lo_backing_file = file;
+	lo->lo_blocksize = S_ISBLK(mapping->host->i_mode) ?
+		mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
+	lo->old_gfp_mask = mapping_gfp_mask(mapping);
+	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
 }
 
 /*
@@ -579,15 +510,18 @@ static int loop_thread(void *data)
 static int loop_switch(struct loop_device *lo, struct file *file)
 {
 	struct switch_request w;
-	struct bio *bio = bio_alloc(GFP_KERNEL, 0);
-	if (!bio)
-		return -ENOMEM;
-	init_completion(&w.wait);
+
 	w.file = file;
-	bio->bi_private = &w;
-	bio->bi_bdev = NULL;
-	loop_make_request(lo->lo_queue, bio);
-	wait_for_completion(&w.wait);
+
+	/* freeze queue and wait for completion of scheduled requests */
+	blk_mq_freeze_queue(lo->lo_queue);
+
+	/* do the switch action */
+	do_loop_switch(lo, &w);
+
+	/* unfreeze */
+	blk_mq_unfreeze_queue(lo->lo_queue);
+
 	return 0;
 }
 
@@ -596,39 +530,10 @@ static int loop_switch(struct loop_device *lo, struct file *file)
  */
 static int loop_flush(struct loop_device *lo)
 {
-	/* loop not yet configured, no running thread, nothing to flush */
-	if (!lo->lo_thread)
-		return 0;
-
 	return loop_switch(lo, NULL);
 }
 
 /*
- * Do the actual switch; called from the BIO completion routine
- */
-static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
-{
-	struct file *file = p->file;
-	struct file *old_file = lo->lo_backing_file;
-	struct address_space *mapping;
-
-	/* if no new file, only flush of queued bios requested */
-	if (!file)
-		goto out;
-
-	mapping = file->f_mapping;
-	mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
-	lo->lo_backing_file = file;
-	lo->lo_blocksize = S_ISBLK(mapping->host->i_mode) ?
-		mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
-	lo->old_gfp_mask = mapping_gfp_mask(mapping);
-	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
-out:
-	complete(&p->wait);
-}
-
-
-/*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
  * the original file and in High Availability environments to switch to
@@ -889,12 +794,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	lo->transfer = transfer_none;
 	lo->ioctl = NULL;
 	lo->lo_sizelimit = 0;
-	lo->lo_bio_count = 0;
 	lo->old_gfp_mask = mapping_gfp_mask(mapping);
 	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
 
-	bio_list_init(&lo->lo_bio_list);
-
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
 		blk_queue_flush(lo->lo_queue, REQ_FLUSH);
 
@@ -906,14 +808,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
 	set_blocksize(bdev, lo_blocksize);
 
-	lo->lo_thread = kthread_create(loop_thread, lo, "loop%d",
-						lo->lo_number);
-	if (IS_ERR(lo->lo_thread)) {
-		error = PTR_ERR(lo->lo_thread);
-		goto out_clr;
-	}
 	lo->lo_state = Lo_bound;
-	wake_up_process(lo->lo_thread);
 	if (part_shift)
 		lo->lo_flags |= LO_FLAGS_PARTSCAN;
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN)
@@ -925,18 +820,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	bdgrab(bdev);
 	return 0;
 
-out_clr:
-	loop_sysfs_exit(lo);
-	lo->lo_thread = NULL;
-	lo->lo_device = NULL;
-	lo->lo_backing_file = NULL;
-	lo->lo_flags = 0;
-	set_capacity(lo->lo_disk, 0);
-	invalidate_bdev(bdev);
-	bd_set_size(bdev, 0);
-	kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
-	mapping_set_gfp_mask(mapping, lo->old_gfp_mask);
-	lo->lo_state = Lo_unbound;
  out_putf:
 	fput(file);
  out:
@@ -1012,11 +895,6 @@ static int loop_clr_fd(struct loop_device *lo)
 
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_rundown;
-	spin_unlock_irq(&lo->lo_lock);
-
-	kthread_stop(lo->lo_thread);
-
-	spin_lock_irq(&lo->lo_lock);
 	lo->lo_backing_file = NULL;
 	spin_unlock_irq(&lo->lo_lock);
 
@@ -1028,7 +906,6 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_offset = 0;
 	lo->lo_sizelimit = 0;
 	lo->lo_encrypt_key_size = 0;
-	lo->lo_thread = NULL;
 	memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE);
 	memset(lo->lo_crypt_name, 0, LO_NAME_SIZE);
 	memset(lo->lo_file_name, 0, LO_NAME_SIZE);
@@ -1601,6 +1478,84 @@ int loop_unregister_transfer(int number)
 EXPORT_SYMBOL(loop_register_transfer);
 EXPORT_SYMBOL(loop_unregister_transfer);
 
+static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+
+	queue_work(loop_wq, &cmd->work);
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int loop_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			  unsigned int index)
+{
+	struct loop_device *lo = data;
+
+	hctx->driver_data = lo;
+	return 0;
+}
+
+static void loop_softirq_done_fn(struct request *rq)
+{
+	blk_mq_end_io(rq, rq->errors);
+}
+
+static void loop_queue_work(struct work_struct *work)
+{
+	struct loop_cmd *cmd =
+		container_of(work, struct loop_cmd, work);
+	const bool write = cmd->rq->cmd_flags & REQ_WRITE;
+	struct loop_device *lo = cmd->lo;
+	int ret = -EIO;
+	struct bio *bio;
+
+	if (lo->lo_state != Lo_bound)
+		goto failed;
+
+	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
+		goto failed;
+
+	ret = 0;
+	__rq_for_each_bio(bio, cmd->rq)
+		ret |= loop_handle_bio(lo, bio);
+
+ failed:
+	if (ret)
+		cmd->rq->errors = -EIO;
+	blk_mq_complete_request(cmd->rq);
+}
+
+static int loop_init_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx,
+		unsigned int numa_node)
+{
+	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+
+	cmd->rq = rq;
+	cmd->lo = data;
+	INIT_WORK(&cmd->work, loop_queue_work);
+
+	return 0;
+}
+
+static int loop_init_flush_rq(void *data, struct request_queue *q,
+		struct request *flush_rq,
+		const struct request *src_rq)
+{
+	/* borrow initialization helper for common rq */
+	loop_init_request(data, flush_rq, 0, -1, NUMA_NO_NODE);
+	return 0;
+}
+
+static struct blk_mq_ops loop_mq_ops = {
+	.queue_rq       = loop_queue_rq,
+	.map_queue      = blk_mq_map_queue,
+	.init_hctx	= loop_init_hctx,
+	.init_request	= loop_init_request,
+	.init_flush_rq  = loop_init_flush_rq,
+	.complete	= loop_softirq_done_fn,
+};
+
 static int loop_add(struct loop_device **l, int i)
 {
 	struct loop_device *lo;
@@ -1627,15 +1582,20 @@ static int loop_add(struct loop_device **l, int i)
 	i = err;
 
 	err = -ENOMEM;
-	lo->lo_queue = blk_alloc_queue(GFP_KERNEL);
-	if (!lo->lo_queue)
+	lo->tag_set.ops = &loop_mq_ops;
+	lo->tag_set.nr_hw_queues = 1;
+	lo->tag_set.queue_depth = 128;
+	lo->tag_set.numa_node = NUMA_NO_NODE;
+	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.driver_data = lo;
+
+	if (blk_mq_alloc_tag_set(&lo->tag_set))
 		goto out_free_idr;
 
-	/*
-	 * set queue make_request_fn
-	 */
-	blk_queue_make_request(lo->lo_queue, loop_make_request);
-	lo->lo_queue->queuedata = lo;
+	lo->lo_queue = blk_mq_init_queue(&lo->tag_set);
+	if (!lo->lo_queue)
+		goto out_cleanup_tags;
 
 	disk = lo->lo_disk = alloc_disk(1 << part_shift);
 	if (!disk)
@@ -1664,9 +1624,6 @@ static int loop_add(struct loop_device **l, int i)
 	disk->flags |= GENHD_FL_EXT_DEVT;
 	mutex_init(&lo->lo_ctl_mutex);
 	lo->lo_number		= i;
-	lo->lo_thread		= NULL;
-	init_waitqueue_head(&lo->lo_event);
-	init_waitqueue_head(&lo->lo_req_wait);
 	spin_lock_init(&lo->lo_lock);
 	disk->major		= LOOP_MAJOR;
 	disk->first_minor	= i << part_shift;
@@ -1680,6 +1637,8 @@ static int loop_add(struct loop_device **l, int i)
 
 out_free_queue:
 	blk_cleanup_queue(lo->lo_queue);
+out_cleanup_tags:
+	blk_mq_free_tag_set(&lo->tag_set);
 out_free_idr:
 	idr_remove(&loop_index_idr, i);
 out_free_dev:
@@ -1692,6 +1651,7 @@ static void loop_remove(struct loop_device *lo)
 {
 	del_gendisk(lo->lo_disk);
 	blk_cleanup_queue(lo->lo_queue);
+	blk_mq_free_tag_set(&lo->tag_set);
 	put_disk(lo->lo_disk);
 	kfree(lo);
 }
@@ -1884,6 +1844,10 @@ static int __init loop_init(void)
 		loop_add(&lo, i);
 	mutex_unlock(&loop_index_mutex);
 
+	loop_wq = alloc_workqueue("kloopd", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+	if (!loop_wq)
+		panic("Failed to create kloopd\n");
+
 	printk(KERN_INFO "loop: module loaded\n");
 	return 0;
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 90df5d6..be796c7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -13,6 +13,7 @@
 #include <linux/blkdev.h>
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
+#include <linux/workqueue.h>
 #include <uapi/linux/loop.h>
 
 /* Possible states of device */
@@ -52,19 +53,20 @@ struct loop_device {
 	gfp_t		old_gfp_mask;
 
 	spinlock_t		lo_lock;
-	struct bio_list		lo_bio_list;
-	unsigned int		lo_bio_count;
 	int			lo_state;
 	struct mutex		lo_ctl_mutex;
-	struct task_struct	*lo_thread;
-	wait_queue_head_t	lo_event;
-	/* wait queue for incoming requests */
-	wait_queue_head_t	lo_req_wait;
 
 	struct request_queue	*lo_queue;
+	struct blk_mq_tag_set	tag_set;
 	struct gendisk		*lo_disk;
 };
 
+struct loop_cmd {
+	struct work_struct work;
+	struct request *rq;
+	struct loop_device *lo;
+};
+
 /* Support for loadable transfer modules */
 struct loop_func_table {
 	int number;	/* filter type */ 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

The conversion is a bit straightforward, and use work queue to
dispatch reqests of loop block, so scalability gets improved a
lot, and also thoughput is increased a lot in case of concurrent
I/O requests.

Another benefit is that loop driver code gets simplified
much, and the patch can be thought as cleanup too.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |  294 ++++++++++++++++++++++----------------------------
 drivers/block/loop.h |   14 +--
 2 files changed, 137 insertions(+), 171 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6cb1beb..1af5265 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -75,6 +75,7 @@
 #include <linux/sysfs.h>
 #include <linux/miscdevice.h>
 #include <linux/falloc.h>
+#include <linux/blk-mq.h>
 #include "loop.h"
 
 #include <asm/uaccess.h>
@@ -85,6 +86,8 @@ static DEFINE_MUTEX(loop_index_mutex);
 static int max_part;
 static int part_shift;
 
+static struct workqueue_struct *loop_wq;
+
 /*
  * Transfer functions
  */
@@ -466,109 +469,37 @@ out:
 	return ret;
 }
 
-/*
- * Add bio to back of pending list
- */
-static void loop_add_bio(struct loop_device *lo, struct bio *bio)
-{
-	lo->lo_bio_count++;
-	bio_list_add(&lo->lo_bio_list, bio);
-}
-
-/*
- * Grab first pending buffer
- */
-static struct bio *loop_get_bio(struct loop_device *lo)
-{
-	lo->lo_bio_count--;
-	return bio_list_pop(&lo->lo_bio_list);
-}
-
-static void loop_make_request(struct request_queue *q, struct bio *old_bio)
-{
-	struct loop_device *lo = q->queuedata;
-	int rw = bio_rw(old_bio);
-
-	if (rw == READA)
-		rw = READ;
-
-	BUG_ON(!lo || (rw != READ && rw != WRITE));
-
-	spin_lock_irq(&lo->lo_lock);
-	if (lo->lo_state != Lo_bound)
-		goto out;
-	if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
-		goto out;
-	if (lo->lo_bio_count >= q->nr_congestion_on)
-		wait_event_lock_irq(lo->lo_req_wait,
-				    lo->lo_bio_count < q->nr_congestion_off,
-				    lo->lo_lock);
-	loop_add_bio(lo, old_bio);
-	wake_up(&lo->lo_event);
-	spin_unlock_irq(&lo->lo_lock);
-	return;
-
-out:
-	spin_unlock_irq(&lo->lo_lock);
-	bio_io_error(old_bio);
-}
-
 struct switch_request {
 	struct file *file;
 	struct completion wait;
 };
 
-static void do_loop_switch(struct loop_device *, struct switch_request *);
-
-static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
+static inline int loop_handle_bio(struct loop_device *lo, struct bio *bio)
 {
-	if (unlikely(!bio->bi_bdev)) {
-		do_loop_switch(lo, bio->bi_private);
-		bio_put(bio);
-	} else {
-		int ret = do_bio_filebacked(lo, bio);
-		bio_endio(bio, ret);
-	}
+	int ret = do_bio_filebacked(lo, bio);
+	return ret;
 }
 
 /*
- * worker thread that handles reads/writes to file backed loop devices,
- * to avoid blocking in our make_request_fn. it also does loop decrypting
- * on reads for block backed loop, as that is too heavy to do from
- * b_end_io context where irqs may be disabled.
- *
- * Loop explanation:  loop_clr_fd() sets lo_state to Lo_rundown before
- * calling kthread_stop().  Therefore once kthread_should_stop() is
- * true, make_request will not place any more requests.  Therefore
- * once kthread_should_stop() is true and lo_bio is NULL, we are
- * done with the loop.
+ * Do the actual switch; called from the BIO completion routine
  */
-static int loop_thread(void *data)
+static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
 {
-	struct loop_device *lo = data;
-	struct bio *bio;
-
-	set_user_nice(current, MIN_NICE);
-
-	while (!kthread_should_stop() || !bio_list_empty(&lo->lo_bio_list)) {
-
-		wait_event_interruptible(lo->lo_event,
-				!bio_list_empty(&lo->lo_bio_list) ||
-				kthread_should_stop());
-
-		if (bio_list_empty(&lo->lo_bio_list))
-			continue;
-		spin_lock_irq(&lo->lo_lock);
-		bio = loop_get_bio(lo);
-		if (lo->lo_bio_count < lo->lo_queue->nr_congestion_off)
-			wake_up(&lo->lo_req_wait);
-		spin_unlock_irq(&lo->lo_lock);
+	struct file *file = p->file;
+	struct file *old_file = lo->lo_backing_file;
+	struct address_space *mapping;
 
-		BUG_ON(!bio);
-		loop_handle_bio(lo, bio);
-	}
+	/* if no new file, only flush of queued bios requested */
+	if (!file)
+		return;
 
-	return 0;
+	mapping = file->f_mapping;
+	mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
+	lo->lo_backing_file = file;
+	lo->lo_blocksize = S_ISBLK(mapping->host->i_mode) ?
+		mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
+	lo->old_gfp_mask = mapping_gfp_mask(mapping);
+	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
 }
 
 /*
@@ -579,15 +510,18 @@ static int loop_thread(void *data)
 static int loop_switch(struct loop_device *lo, struct file *file)
 {
 	struct switch_request w;
-	struct bio *bio = bio_alloc(GFP_KERNEL, 0);
-	if (!bio)
-		return -ENOMEM;
-	init_completion(&w.wait);
+
 	w.file = file;
-	bio->bi_private = &w;
-	bio->bi_bdev = NULL;
-	loop_make_request(lo->lo_queue, bio);
-	wait_for_completion(&w.wait);
+
+	/* freeze queue and wait for completion of scheduled requests */
+	blk_mq_freeze_queue(lo->lo_queue);
+
+	/* do the switch action */
+	do_loop_switch(lo, &w);
+
+	/* unfreeze */
+	blk_mq_unfreeze_queue(lo->lo_queue);
+
 	return 0;
 }
 
@@ -596,39 +530,10 @@ static int loop_switch(struct loop_device *lo, struct file *file)
  */
 static int loop_flush(struct loop_device *lo)
 {
-	/* loop not yet configured, no running thread, nothing to flush */
-	if (!lo->lo_thread)
-		return 0;
-
 	return loop_switch(lo, NULL);
 }
 
 /*
- * Do the actual switch; called from the BIO completion routine
- */
-static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
-{
-	struct file *file = p->file;
-	struct file *old_file = lo->lo_backing_file;
-	struct address_space *mapping;
-
-	/* if no new file, only flush of queued bios requested */
-	if (!file)
-		goto out;
-
-	mapping = file->f_mapping;
-	mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
-	lo->lo_backing_file = file;
-	lo->lo_blocksize = S_ISBLK(mapping->host->i_mode) ?
-		mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
-	lo->old_gfp_mask = mapping_gfp_mask(mapping);
-	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
-out:
-	complete(&p->wait);
-}
-
-
-/*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
  * the original file and in High Availability environments to switch to
@@ -889,12 +794,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	lo->transfer = transfer_none;
 	lo->ioctl = NULL;
 	lo->lo_sizelimit = 0;
-	lo->lo_bio_count = 0;
 	lo->old_gfp_mask = mapping_gfp_mask(mapping);
 	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
 
-	bio_list_init(&lo->lo_bio_list);
-
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
 		blk_queue_flush(lo->lo_queue, REQ_FLUSH);
 
@@ -906,14 +808,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
 	set_blocksize(bdev, lo_blocksize);
 
-	lo->lo_thread = kthread_create(loop_thread, lo, "loop%d",
-						lo->lo_number);
-	if (IS_ERR(lo->lo_thread)) {
-		error = PTR_ERR(lo->lo_thread);
-		goto out_clr;
-	}
 	lo->lo_state = Lo_bound;
-	wake_up_process(lo->lo_thread);
 	if (part_shift)
 		lo->lo_flags |= LO_FLAGS_PARTSCAN;
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN)
@@ -925,18 +820,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	bdgrab(bdev);
 	return 0;
 
-out_clr:
-	loop_sysfs_exit(lo);
-	lo->lo_thread = NULL;
-	lo->lo_device = NULL;
-	lo->lo_backing_file = NULL;
-	lo->lo_flags = 0;
-	set_capacity(lo->lo_disk, 0);
-	invalidate_bdev(bdev);
-	bd_set_size(bdev, 0);
-	kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
-	mapping_set_gfp_mask(mapping, lo->old_gfp_mask);
-	lo->lo_state = Lo_unbound;
  out_putf:
 	fput(file);
  out:
@@ -1012,11 +895,6 @@ static int loop_clr_fd(struct loop_device *lo)
 
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_rundown;
-	spin_unlock_irq(&lo->lo_lock);
-
-	kthread_stop(lo->lo_thread);
-
-	spin_lock_irq(&lo->lo_lock);
 	lo->lo_backing_file = NULL;
 	spin_unlock_irq(&lo->lo_lock);
 
@@ -1028,7 +906,6 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_offset = 0;
 	lo->lo_sizelimit = 0;
 	lo->lo_encrypt_key_size = 0;
-	lo->lo_thread = NULL;
 	memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE);
 	memset(lo->lo_crypt_name, 0, LO_NAME_SIZE);
 	memset(lo->lo_file_name, 0, LO_NAME_SIZE);
@@ -1601,6 +1478,84 @@ int loop_unregister_transfer(int number)
 EXPORT_SYMBOL(loop_register_transfer);
 EXPORT_SYMBOL(loop_unregister_transfer);
 
+static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+
+	queue_work(loop_wq, &cmd->work);
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int loop_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			  unsigned int index)
+{
+	struct loop_device *lo = data;
+
+	hctx->driver_data = lo;
+	return 0;
+}
+
+static void loop_softirq_done_fn(struct request *rq)
+{
+	blk_mq_end_io(rq, rq->errors);
+}
+
+static void loop_queue_work(struct work_struct *work)
+{
+	struct loop_cmd *cmd =
+		container_of(work, struct loop_cmd, work);
+	const bool write = cmd->rq->cmd_flags & REQ_WRITE;
+	struct loop_device *lo = cmd->lo;
+	int ret = -EIO;
+	struct bio *bio;
+
+	if (lo->lo_state != Lo_bound)
+		goto failed;
+
+	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
+		goto failed;
+
+	ret = 0;
+	__rq_for_each_bio(bio, cmd->rq)
+		ret |= loop_handle_bio(lo, bio);
+
+ failed:
+	if (ret)
+		cmd->rq->errors = -EIO;
+	blk_mq_complete_request(cmd->rq);
+}
+
+static int loop_init_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx,
+		unsigned int numa_node)
+{
+	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+
+	cmd->rq = rq;
+	cmd->lo = data;
+	INIT_WORK(&cmd->work, loop_queue_work);
+
+	return 0;
+}
+
+static int loop_init_flush_rq(void *data, struct request_queue *q,
+		struct request *flush_rq,
+		const struct request *src_rq)
+{
+	/* borrow initialization helper for common rq */
+	loop_init_request(data, flush_rq, 0, -1, NUMA_NO_NODE);
+	return 0;
+}
+
+static struct blk_mq_ops loop_mq_ops = {
+	.queue_rq       = loop_queue_rq,
+	.map_queue      = blk_mq_map_queue,
+	.init_hctx	= loop_init_hctx,
+	.init_request	= loop_init_request,
+	.init_flush_rq  = loop_init_flush_rq,
+	.complete	= loop_softirq_done_fn,
+};
+
 static int loop_add(struct loop_device **l, int i)
 {
 	struct loop_device *lo;
@@ -1627,15 +1582,20 @@ static int loop_add(struct loop_device **l, int i)
 	i = err;
 
 	err = -ENOMEM;
-	lo->lo_queue = blk_alloc_queue(GFP_KERNEL);
-	if (!lo->lo_queue)
+	lo->tag_set.ops = &loop_mq_ops;
+	lo->tag_set.nr_hw_queues = 1;
+	lo->tag_set.queue_depth = 128;
+	lo->tag_set.numa_node = NUMA_NO_NODE;
+	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.driver_data = lo;
+
+	if (blk_mq_alloc_tag_set(&lo->tag_set))
 		goto out_free_idr;
 
-	/*
-	 * set queue make_request_fn
-	 */
-	blk_queue_make_request(lo->lo_queue, loop_make_request);
-	lo->lo_queue->queuedata = lo;
+	lo->lo_queue = blk_mq_init_queue(&lo->tag_set);
+	if (!lo->lo_queue)
+		goto out_cleanup_tags;
 
 	disk = lo->lo_disk = alloc_disk(1 << part_shift);
 	if (!disk)
@@ -1664,9 +1624,6 @@ static int loop_add(struct loop_device **l, int i)
 	disk->flags |= GENHD_FL_EXT_DEVT;
 	mutex_init(&lo->lo_ctl_mutex);
 	lo->lo_number		= i;
-	lo->lo_thread		= NULL;
-	init_waitqueue_head(&lo->lo_event);
-	init_waitqueue_head(&lo->lo_req_wait);
 	spin_lock_init(&lo->lo_lock);
 	disk->major		= LOOP_MAJOR;
 	disk->first_minor	= i << part_shift;
@@ -1680,6 +1637,8 @@ static int loop_add(struct loop_device **l, int i)
 
 out_free_queue:
 	blk_cleanup_queue(lo->lo_queue);
+out_cleanup_tags:
+	blk_mq_free_tag_set(&lo->tag_set);
 out_free_idr:
 	idr_remove(&loop_index_idr, i);
 out_free_dev:
@@ -1692,6 +1651,7 @@ static void loop_remove(struct loop_device *lo)
 {
 	del_gendisk(lo->lo_disk);
 	blk_cleanup_queue(lo->lo_queue);
+	blk_mq_free_tag_set(&lo->tag_set);
 	put_disk(lo->lo_disk);
 	kfree(lo);
 }
@@ -1884,6 +1844,10 @@ static int __init loop_init(void)
 		loop_add(&lo, i);
 	mutex_unlock(&loop_index_mutex);
 
+	loop_wq = alloc_workqueue("kloopd", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+	if (!loop_wq)
+		panic("Failed to create kloopd\n");
+
 	printk(KERN_INFO "loop: module loaded\n");
 	return 0;
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 90df5d6..be796c7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -13,6 +13,7 @@
 #include <linux/blkdev.h>
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
+#include <linux/workqueue.h>
 #include <uapi/linux/loop.h>
 
 /* Possible states of device */
@@ -52,19 +53,20 @@ struct loop_device {
 	gfp_t		old_gfp_mask;
 
 	spinlock_t		lo_lock;
-	struct bio_list		lo_bio_list;
-	unsigned int		lo_bio_count;
 	int			lo_state;
 	struct mutex		lo_ctl_mutex;
-	struct task_struct	*lo_thread;
-	wait_queue_head_t	lo_event;
-	/* wait queue for incoming requests */
-	wait_queue_head_t	lo_req_wait;
 
 	struct request_queue	*lo_queue;
+	struct blk_mq_tag_set	tag_set;
 	struct gendisk		*lo_disk;
 };
 
+struct loop_cmd {
+	struct work_struct work;
+	struct request *rq;
+	struct loop_device *lo;
+};
+
 /* Support for loadable transfer modules */
 struct loop_func_table {
 	int number;	/* filter type */ 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 6/9] block: loop: say goodby to bio
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

Switch to block request completely.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   46 ++++++++++++++++++++--------------------------
 1 file changed, 20 insertions(+), 26 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1af5265..ad43d4b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -287,12 +287,12 @@ static int do_lo_send_write(struct loop_device *lo, struct bio_vec *bvec,
 	return ret;
 }
 
-static int lo_send(struct loop_device *lo, struct bio *bio, loff_t pos)
+static int lo_send(struct loop_device *lo, struct request *rq, loff_t pos)
 {
 	int (*do_lo_send)(struct loop_device *, struct bio_vec *, loff_t,
 			struct page *page);
 	struct bio_vec bvec;
-	struct bvec_iter iter;
+	struct req_iterator iter;
 	struct page *page = NULL;
 	int ret = 0;
 
@@ -306,7 +306,7 @@ static int lo_send(struct loop_device *lo, struct bio *bio, loff_t pos)
 		do_lo_send = do_lo_send_direct_write;
 	}
 
-	bio_for_each_segment(bvec, bio, iter) {
+	rq_for_each_segment(bvec, rq, iter) {
 		ret = do_lo_send(lo, &bvec, pos, page);
 		if (ret < 0)
 			break;
@@ -394,19 +394,22 @@ do_lo_receive(struct loop_device *lo,
 }
 
 static int
-lo_receive(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos)
+lo_receive(struct loop_device *lo, struct request *rq, int bsize, loff_t pos)
 {
 	struct bio_vec bvec;
-	struct bvec_iter iter;
+	struct req_iterator iter;
 	ssize_t s;
 
-	bio_for_each_segment(bvec, bio, iter) {
+	rq_for_each_segment(bvec, rq, iter) {
 		s = do_lo_receive(lo, &bvec, bsize, pos);
 		if (s < 0)
 			return s;
 
 		if (s != bvec.bv_len) {
-			zero_fill_bio(bio);
+			struct bio *bio;
+
+			__rq_for_each_bio(bio, rq)
+				zero_fill_bio(bio);
 			break;
 		}
 		pos += bvec.bv_len;
@@ -414,17 +417,17 @@ lo_receive(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos)
 	return 0;
 }
 
-static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
+static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	loff_t pos;
 	int ret;
 
-	pos = ((loff_t) bio->bi_iter.bi_sector << 9) + lo->lo_offset;
+	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
-	if (bio_rw(bio) == WRITE) {
+	if (rq->cmd_flags & REQ_WRITE) {
 		struct file *file = lo->lo_backing_file;
 
-		if (bio->bi_rw & REQ_FLUSH) {
+		if (rq->cmd_flags & REQ_FLUSH) {
 			ret = vfs_fsync(file, 0);
 			if (unlikely(ret && ret != -EINVAL)) {
 				ret = -EIO;
@@ -438,7 +441,7 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
 		 * encryption is enabled, because it may give an attacker
 		 * useful information.
 		 */
-		if (bio->bi_rw & REQ_DISCARD) {
+		if (rq->cmd_flags & REQ_DISCARD) {
 			struct file *file = lo->lo_backing_file;
 			int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
 
@@ -448,22 +451,22 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
 				goto out;
 			}
 			ret = file->f_op->fallocate(file, mode, pos,
-						    bio->bi_iter.bi_size);
+						    blk_rq_bytes(rq));
 			if (unlikely(ret && ret != -EINVAL &&
 				     ret != -EOPNOTSUPP))
 				ret = -EIO;
 			goto out;
 		}
 
-		ret = lo_send(lo, bio, pos);
+		ret = lo_send(lo, rq, pos);
 
-		if ((bio->bi_rw & REQ_FUA) && !ret) {
+		if ((rq->cmd_flags & REQ_FUA) && !ret) {
 			ret = vfs_fsync(file, 0);
 			if (unlikely(ret && ret != -EINVAL))
 				ret = -EIO;
 		}
 	} else
-		ret = lo_receive(lo, bio, lo->lo_blocksize, pos);
+		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
 
 out:
 	return ret;
@@ -474,12 +477,6 @@ struct switch_request {
 	struct completion wait;
 };
 
-static inline int loop_handle_bio(struct loop_device *lo, struct bio *bio)
-{
-	int ret = do_bio_filebacked(lo, bio);
-	return ret;
-}
-
 /*
  * Do the actual switch; called from the BIO completion routine
  */
@@ -1507,7 +1504,6 @@ static void loop_queue_work(struct work_struct *work)
 	const bool write = cmd->rq->cmd_flags & REQ_WRITE;
 	struct loop_device *lo = cmd->lo;
 	int ret = -EIO;
-	struct bio *bio;
 
 	if (lo->lo_state != Lo_bound)
 		goto failed;
@@ -1515,9 +1511,7 @@ static void loop_queue_work(struct work_struct *work)
 	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
 		goto failed;
 
-	ret = 0;
-	__rq_for_each_bio(bio, cmd->rq)
-		ret |= loop_handle_bio(lo, bio);
+	ret = do_req_filebacked(lo, cmd->rq);
 
  failed:
 	if (ret)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 6/9] block: loop: say goodby to bio
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

Switch to block request completely.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   46 ++++++++++++++++++++--------------------------
 1 file changed, 20 insertions(+), 26 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1af5265..ad43d4b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -287,12 +287,12 @@ static int do_lo_send_write(struct loop_device *lo, struct bio_vec *bvec,
 	return ret;
 }
 
-static int lo_send(struct loop_device *lo, struct bio *bio, loff_t pos)
+static int lo_send(struct loop_device *lo, struct request *rq, loff_t pos)
 {
 	int (*do_lo_send)(struct loop_device *, struct bio_vec *, loff_t,
 			struct page *page);
 	struct bio_vec bvec;
-	struct bvec_iter iter;
+	struct req_iterator iter;
 	struct page *page = NULL;
 	int ret = 0;
 
@@ -306,7 +306,7 @@ static int lo_send(struct loop_device *lo, struct bio *bio, loff_t pos)
 		do_lo_send = do_lo_send_direct_write;
 	}
 
-	bio_for_each_segment(bvec, bio, iter) {
+	rq_for_each_segment(bvec, rq, iter) {
 		ret = do_lo_send(lo, &bvec, pos, page);
 		if (ret < 0)
 			break;
@@ -394,19 +394,22 @@ do_lo_receive(struct loop_device *lo,
 }
 
 static int
-lo_receive(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos)
+lo_receive(struct loop_device *lo, struct request *rq, int bsize, loff_t pos)
 {
 	struct bio_vec bvec;
-	struct bvec_iter iter;
+	struct req_iterator iter;
 	ssize_t s;
 
-	bio_for_each_segment(bvec, bio, iter) {
+	rq_for_each_segment(bvec, rq, iter) {
 		s = do_lo_receive(lo, &bvec, bsize, pos);
 		if (s < 0)
 			return s;
 
 		if (s != bvec.bv_len) {
-			zero_fill_bio(bio);
+			struct bio *bio;
+
+			__rq_for_each_bio(bio, rq)
+				zero_fill_bio(bio);
 			break;
 		}
 		pos += bvec.bv_len;
@@ -414,17 +417,17 @@ lo_receive(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos)
 	return 0;
 }
 
-static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
+static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	loff_t pos;
 	int ret;
 
-	pos = ((loff_t) bio->bi_iter.bi_sector << 9) + lo->lo_offset;
+	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
-	if (bio_rw(bio) == WRITE) {
+	if (rq->cmd_flags & REQ_WRITE) {
 		struct file *file = lo->lo_backing_file;
 
-		if (bio->bi_rw & REQ_FLUSH) {
+		if (rq->cmd_flags & REQ_FLUSH) {
 			ret = vfs_fsync(file, 0);
 			if (unlikely(ret && ret != -EINVAL)) {
 				ret = -EIO;
@@ -438,7 +441,7 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
 		 * encryption is enabled, because it may give an attacker
 		 * useful information.
 		 */
-		if (bio->bi_rw & REQ_DISCARD) {
+		if (rq->cmd_flags & REQ_DISCARD) {
 			struct file *file = lo->lo_backing_file;
 			int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
 
@@ -448,22 +451,22 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
 				goto out;
 			}
 			ret = file->f_op->fallocate(file, mode, pos,
-						    bio->bi_iter.bi_size);
+						    blk_rq_bytes(rq));
 			if (unlikely(ret && ret != -EINVAL &&
 				     ret != -EOPNOTSUPP))
 				ret = -EIO;
 			goto out;
 		}
 
-		ret = lo_send(lo, bio, pos);
+		ret = lo_send(lo, rq, pos);
 
-		if ((bio->bi_rw & REQ_FUA) && !ret) {
+		if ((rq->cmd_flags & REQ_FUA) && !ret) {
 			ret = vfs_fsync(file, 0);
 			if (unlikely(ret && ret != -EINVAL))
 				ret = -EIO;
 		}
 	} else
-		ret = lo_receive(lo, bio, lo->lo_blocksize, pos);
+		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
 
 out:
 	return ret;
@@ -474,12 +477,6 @@ struct switch_request {
 	struct completion wait;
 };
 
-static inline int loop_handle_bio(struct loop_device *lo, struct bio *bio)
-{
-	int ret = do_bio_filebacked(lo, bio);
-	return ret;
-}
-
 /*
  * Do the actual switch; called from the BIO completion routine
  */
@@ -1507,7 +1504,6 @@ static void loop_queue_work(struct work_struct *work)
 	const bool write = cmd->rq->cmd_flags & REQ_WRITE;
 	struct loop_device *lo = cmd->lo;
 	int ret = -EIO;
-	struct bio *bio;
 
 	if (lo->lo_state != Lo_bound)
 		goto failed;
@@ -1515,9 +1511,7 @@ static void loop_queue_work(struct work_struct *work)
 	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
 		goto failed;
 
-	ret = 0;
-	__rq_for_each_bio(bio, cmd->rq)
-		ret |= loop_handle_bio(lo, bio);
+	ret = do_req_filebacked(lo, cmd->rq);
 
  failed:
 	if (ret)
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 7/9] block: loop: introduce lo_discard() and lo_req_flush()
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

No behaviour change, just move the handling for REQ_DISCARD
and REQ_FLUSH in these two functions.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   73 +++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 33 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ad43d4b..3836dcc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -417,6 +417,40 @@ lo_receive(struct loop_device *lo, struct request *rq, int bsize, loff_t pos)
 	return 0;
 }
 
+static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+	/*
+	 * We use punch hole to reclaim the free space used by the
+	 * image a.k.a. discard. However we do not support discard if
+	 * encryption is enabled, because it may give an attacker
+	 * useful information.
+	 */
+	struct file *file = lo->lo_backing_file;
+	int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+	int ret;
+
+	if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
+	if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
+		ret = -EIO;
+ out:
+	return ret;
+}
+
+static int lo_req_flush(struct loop_device *lo, struct request *rq)
+{
+	struct file *file = lo->lo_backing_file;
+	int ret = vfs_fsync(file, 0);
+	if (unlikely(ret && ret != -EINVAL))
+		ret = -EIO;
+
+	return ret;
+}
+
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	loff_t pos;
@@ -425,46 +459,19 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
 	if (rq->cmd_flags & REQ_WRITE) {
-		struct file *file = lo->lo_backing_file;
-
-		if (rq->cmd_flags & REQ_FLUSH) {
-			ret = vfs_fsync(file, 0);
-			if (unlikely(ret && ret != -EINVAL)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
 
-		/*
-		 * We use punch hole to reclaim the free space used by the
-		 * image a.k.a. discard. However we do not support discard if
-		 * encryption is enabled, because it may give an attacker
-		 * useful information.
-		 */
+		if (rq->cmd_flags & REQ_FLUSH)
+			ret = lo_req_flush(lo, rq);
+
 		if (rq->cmd_flags & REQ_DISCARD) {
-			struct file *file = lo->lo_backing_file;
-			int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
-
-			if ((!file->f_op->fallocate) ||
-			    lo->lo_encrypt_key_size) {
-				ret = -EOPNOTSUPP;
-				goto out;
-			}
-			ret = file->f_op->fallocate(file, mode, pos,
-						    blk_rq_bytes(rq));
-			if (unlikely(ret && ret != -EINVAL &&
-				     ret != -EOPNOTSUPP))
-				ret = -EIO;
+			ret = lo_discard(lo, rq, pos);
 			goto out;
 		}
 
 		ret = lo_send(lo, rq, pos);
 
-		if ((rq->cmd_flags & REQ_FUA) && !ret) {
-			ret = vfs_fsync(file, 0);
-			if (unlikely(ret && ret != -EINVAL))
-				ret = -EIO;
-		}
+		if ((rq->cmd_flags & REQ_FUA) && !ret)
+			ret = lo_req_flush(lo, rq);
 	} else
 		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 7/9] block: loop: introduce lo_discard() and lo_req_flush()
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

No behaviour change, just move the handling for REQ_DISCARD
and REQ_FLUSH in these two functions.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   73 +++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 33 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ad43d4b..3836dcc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -417,6 +417,40 @@ lo_receive(struct loop_device *lo, struct request *rq, int bsize, loff_t pos)
 	return 0;
 }
 
+static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+	/*
+	 * We use punch hole to reclaim the free space used by the
+	 * image a.k.a. discard. However we do not support discard if
+	 * encryption is enabled, because it may give an attacker
+	 * useful information.
+	 */
+	struct file *file = lo->lo_backing_file;
+	int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+	int ret;
+
+	if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
+	if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
+		ret = -EIO;
+ out:
+	return ret;
+}
+
+static int lo_req_flush(struct loop_device *lo, struct request *rq)
+{
+	struct file *file = lo->lo_backing_file;
+	int ret = vfs_fsync(file, 0);
+	if (unlikely(ret && ret != -EINVAL))
+		ret = -EIO;
+
+	return ret;
+}
+
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	loff_t pos;
@@ -425,46 +459,19 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
 	if (rq->cmd_flags & REQ_WRITE) {
-		struct file *file = lo->lo_backing_file;
-
-		if (rq->cmd_flags & REQ_FLUSH) {
-			ret = vfs_fsync(file, 0);
-			if (unlikely(ret && ret != -EINVAL)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
 
-		/*
-		 * We use punch hole to reclaim the free space used by the
-		 * image a.k.a. discard. However we do not support discard if
-		 * encryption is enabled, because it may give an attacker
-		 * useful information.
-		 */
+		if (rq->cmd_flags & REQ_FLUSH)
+			ret = lo_req_flush(lo, rq);
+
 		if (rq->cmd_flags & REQ_DISCARD) {
-			struct file *file = lo->lo_backing_file;
-			int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
-
-			if ((!file->f_op->fallocate) ||
-			    lo->lo_encrypt_key_size) {
-				ret = -EOPNOTSUPP;
-				goto out;
-			}
-			ret = file->f_op->fallocate(file, mode, pos,
-						    blk_rq_bytes(rq));
-			if (unlikely(ret && ret != -EINVAL &&
-				     ret != -EOPNOTSUPP))
-				ret = -EIO;
+			ret = lo_discard(lo, rq, pos);
 			goto out;
 		}
 
 		ret = lo_send(lo, rq, pos);
 
-		if ((rq->cmd_flags & REQ_FUA) && !ret) {
-			ret = vfs_fsync(file, 0);
-			if (unlikely(ret && ret != -EINVAL))
-				ret = -EIO;
-		}
+		if ((rq->cmd_flags & REQ_FUA) && !ret)
+			ret = lo_req_flush(lo, rq);
 	} else
 		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 8/9] block: loop: don't handle REQ_FUA explicitly
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

block core handles REQ_FUA by its flush state machine, so
won't do it in loop explicitly.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3836dcc..0ce51ee 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -459,23 +459,15 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
 	if (rq->cmd_flags & REQ_WRITE) {
-
 		if (rq->cmd_flags & REQ_FLUSH)
 			ret = lo_req_flush(lo, rq);
-
-		if (rq->cmd_flags & REQ_DISCARD) {
+		else if (rq->cmd_flags & REQ_DISCARD)
 			ret = lo_discard(lo, rq, pos);
-			goto out;
-		}
-
-		ret = lo_send(lo, rq, pos);
-
-		if ((rq->cmd_flags & REQ_FUA) && !ret)
-			ret = lo_req_flush(lo, rq);
+		else
+			ret = lo_send(lo, rq, pos);
 	} else
 		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
 
-out:
 	return ret;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 8/9] block: loop: don't handle REQ_FUA explicitly
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

block core handles REQ_FUA by its flush state machine, so
won't do it in loop explicitly.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3836dcc..0ce51ee 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -459,23 +459,15 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
 	if (rq->cmd_flags & REQ_WRITE) {
-
 		if (rq->cmd_flags & REQ_FLUSH)
 			ret = lo_req_flush(lo, rq);
-
-		if (rq->cmd_flags & REQ_DISCARD) {
+		else if (rq->cmd_flags & REQ_DISCARD)
 			ret = lo_discard(lo, rq, pos);
-			goto out;
-		}
-
-		ret = lo_send(lo, rq, pos);
-
-		if ((rq->cmd_flags & REQ_FUA) && !ret)
-			ret = lo_req_flush(lo, rq);
+		else
+			ret = lo_send(lo, rq, pos);
 	} else
 		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
 
-out:
 	return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 9/9] block: loop: support to submit I/O via kernel aio based
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 15:50   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

Part of the patch is based on Dave's previous post.

It is easy to observe that loop block device thoughput
can be increased by > 100% in single job randread,
libaio engine, direct I/O fio test.

Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c      |  121 ++++++++++++++++++++++++++++++++++++++++-----
 drivers/block/loop.h      |    1 +
 include/uapi/linux/loop.h |    1 +
 3 files changed, 112 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ce51ee..b57f603 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,7 @@
 #include <linux/miscdevice.h>
 #include <linux/falloc.h>
 #include <linux/blk-mq.h>
+#include <linux/aio.h>
 #include "loop.h"
 
 #include <asm/uaccess.h>
@@ -451,22 +452,99 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq)
 	return ret;
 }
 
-static int do_req_filebacked(struct loop_device *lo, struct request *rq)
+#ifdef CONFIG_AIO
+static void lo_rw_aio_complete(u64 data, long res)
+{
+	struct loop_cmd *cmd = (struct loop_cmd *)(uintptr_t)data;
+	struct request *rq = cmd->rq;
+
+	if (res > 0)
+		res = 0;
+	else if (res < 0)
+		res = -EIO;
+
+	rq->errors = res;
+	aio_kernel_free(cmd->iocb);
+	blk_mq_complete_request(rq);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     bool write, loff_t pos)
+{
+	struct file *file = lo->lo_backing_file;
+	struct request *rq = cmd->rq;
+	struct kiocb *iocb;
+	unsigned int i = 0;
+	struct iov_iter iter;
+	struct bio_vec *bvec, bv;
+	size_t nr_segs = 0;
+	struct req_iterator r_iter;
+	int ret = -EIO;
+
+	/* how many segments */
+	rq_for_each_segment(bv, rq, r_iter)
+		nr_segs++;
+
+	iocb = aio_kernel_alloc(GFP_NOIO, nr_segs * sizeof(*bvec));
+	if (!iocb) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	cmd->iocb = iocb;
+	bvec = (struct bio_vec *)(iocb + 1);
+	rq_for_each_segment(bv, rq, r_iter)
+		bvec[i++] = bv;
+
+	iter.type = ITER_BVEC | (write ? WRITE : 0);
+	iter.bvec = bvec;
+	iter.nr_segs = nr_segs;
+	iter.count = blk_rq_bytes(rq);
+	iter.iov_offset = 0;
+
+	aio_kernel_init_rw(iocb, file, iov_iter_count(&iter), pos,
+			   lo_rw_aio_complete, (u64)(uintptr_t)cmd);
+	ret = aio_kernel_submit(iocb, write, &iter);
+ out:
+	return ret;
+}
+#endif /* CONFIG_AIO */
+
+static int lo_io_rw(struct loop_device *lo, struct loop_cmd *cmd,
+		    bool write, loff_t pos)
+{
+#ifdef CONFIG_AIO
+	if (lo->lo_flags & LO_FLAGS_USE_AIO)
+		return lo_rw_aio(lo, cmd, write, pos);
+#endif
+	if (write)
+		return lo_send(lo, cmd->rq, pos);
+	else
+		return lo_receive(lo, cmd->rq, lo->lo_blocksize, pos);
+}
+
+static int do_req_filebacked(struct loop_device *lo,
+			     struct loop_cmd *cmd, bool *sync)
 {
 	loff_t pos;
 	int ret;
+	struct request *rq = cmd->rq;
 
+	*sync = false;
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
 	if (rq->cmd_flags & REQ_WRITE) {
-		if (rq->cmd_flags & REQ_FLUSH)
+		if (rq->cmd_flags & REQ_FLUSH) {
 			ret = lo_req_flush(lo, rq);
-		else if (rq->cmd_flags & REQ_DISCARD)
+			*sync = true;
+		} else if (rq->cmd_flags & REQ_DISCARD) {
 			ret = lo_discard(lo, rq, pos);
-		else
-			ret = lo_send(lo, rq, pos);
+			*sync = true;
+		} else {
+			ret = lo_io_rw(lo, cmd, true, pos);
+		}
 	} else
-		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
+		ret = lo_io_rw(lo, cmd, false, pos);
 
 	return ret;
 }
@@ -771,6 +849,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	    !file->f_op->write)
 		lo_flags |= LO_FLAGS_READ_ONLY;
 
+#ifdef CONFIG_AIO
+	if (file->f_op->write_iter && file->f_op->read_iter &&
+	    mapping->a_ops->direct_IO) {
+		file->f_flags |= O_DIRECT;
+		lo_flags |= LO_FLAGS_USE_AIO;
+	}
+#endif
+
 	lo_blocksize = S_ISBLK(inode->i_mode) ?
 		inode->i_bdev->bd_block_size : PAGE_SIZE;
 
@@ -804,6 +890,17 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
 	set_blocksize(bdev, lo_blocksize);
 
+#ifdef CONFIG_AIO
+	/*
+	 * We must not send too-small direct-io requests, so we reflect
+	 * the minimum io size to the loop device's logical block size
+	 */
+	if ((lo_flags & LO_FLAGS_USE_AIO) && inode->i_sb->s_bdev)
+		blk_queue_logical_block_size(lo->lo_queue,
+					     bdev_io_min(inode->i_sb->s_bdev));
+#endif
+
+
 	lo->lo_state = Lo_bound;
 	if (part_shift)
 		lo->lo_flags |= LO_FLAGS_PARTSCAN;
@@ -1503,19 +1600,21 @@ static void loop_queue_work(struct work_struct *work)
 	const bool write = cmd->rq->cmd_flags & REQ_WRITE;
 	struct loop_device *lo = cmd->lo;
 	int ret = -EIO;
+	bool sync = true;
 
 	if (lo->lo_state != Lo_bound)
-		goto failed;
+		goto out;
 
 	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
-		goto failed;
+		goto out;
 
-	ret = do_req_filebacked(lo, cmd->rq);
+	ret = do_req_filebacked(lo, cmd, &sync);
 
- failed:
+ out:
 	if (ret)
 		cmd->rq->errors = -EIO;
-	blk_mq_complete_request(cmd->rq);
+	if (!(lo->lo_flags & LO_FLAGS_USE_AIO) || sync || ret)
+		blk_mq_complete_request(cmd->rq);
 }
 
 static int loop_init_request(void *data, struct request *rq,
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index be796c7..4004af5 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -65,6 +65,7 @@ struct loop_cmd {
 	struct work_struct work;
 	struct request *rq;
 	struct loop_device *lo;
+	struct kiocb *iocb;
 };
 
 /* Support for loadable transfer modules */
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index e0cecd2..6edc6b6 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -21,6 +21,7 @@ enum {
 	LO_FLAGS_READ_ONLY	= 1,
 	LO_FLAGS_AUTOCLEAR	= 4,
 	LO_FLAGS_PARTSCAN	= 8,
+	LO_FLAGS_USE_AIO	= 16,
 };
 
 #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v1 9/9] block: loop: support to submit I/O via kernel aio based
@ 2014-08-14 15:50   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-14 15:50 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner, Ming Lei

Part of the patch is based on Dave's previous post.

It is easy to observe that loop block device thoughput
can be increased by > 100% in single job randread,
libaio engine, direct I/O fio test.

Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c      |  121 ++++++++++++++++++++++++++++++++++++++++-----
 drivers/block/loop.h      |    1 +
 include/uapi/linux/loop.h |    1 +
 3 files changed, 112 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ce51ee..b57f603 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,7 @@
 #include <linux/miscdevice.h>
 #include <linux/falloc.h>
 #include <linux/blk-mq.h>
+#include <linux/aio.h>
 #include "loop.h"
 
 #include <asm/uaccess.h>
@@ -451,22 +452,99 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq)
 	return ret;
 }
 
-static int do_req_filebacked(struct loop_device *lo, struct request *rq)
+#ifdef CONFIG_AIO
+static void lo_rw_aio_complete(u64 data, long res)
+{
+	struct loop_cmd *cmd = (struct loop_cmd *)(uintptr_t)data;
+	struct request *rq = cmd->rq;
+
+	if (res > 0)
+		res = 0;
+	else if (res < 0)
+		res = -EIO;
+
+	rq->errors = res;
+	aio_kernel_free(cmd->iocb);
+	blk_mq_complete_request(rq);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     bool write, loff_t pos)
+{
+	struct file *file = lo->lo_backing_file;
+	struct request *rq = cmd->rq;
+	struct kiocb *iocb;
+	unsigned int i = 0;
+	struct iov_iter iter;
+	struct bio_vec *bvec, bv;
+	size_t nr_segs = 0;
+	struct req_iterator r_iter;
+	int ret = -EIO;
+
+	/* how many segments */
+	rq_for_each_segment(bv, rq, r_iter)
+		nr_segs++;
+
+	iocb = aio_kernel_alloc(GFP_NOIO, nr_segs * sizeof(*bvec));
+	if (!iocb) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	cmd->iocb = iocb;
+	bvec = (struct bio_vec *)(iocb + 1);
+	rq_for_each_segment(bv, rq, r_iter)
+		bvec[i++] = bv;
+
+	iter.type = ITER_BVEC | (write ? WRITE : 0);
+	iter.bvec = bvec;
+	iter.nr_segs = nr_segs;
+	iter.count = blk_rq_bytes(rq);
+	iter.iov_offset = 0;
+
+	aio_kernel_init_rw(iocb, file, iov_iter_count(&iter), pos,
+			   lo_rw_aio_complete, (u64)(uintptr_t)cmd);
+	ret = aio_kernel_submit(iocb, write, &iter);
+ out:
+	return ret;
+}
+#endif /* CONFIG_AIO */
+
+static int lo_io_rw(struct loop_device *lo, struct loop_cmd *cmd,
+		    bool write, loff_t pos)
+{
+#ifdef CONFIG_AIO
+	if (lo->lo_flags & LO_FLAGS_USE_AIO)
+		return lo_rw_aio(lo, cmd, write, pos);
+#endif
+	if (write)
+		return lo_send(lo, cmd->rq, pos);
+	else
+		return lo_receive(lo, cmd->rq, lo->lo_blocksize, pos);
+}
+
+static int do_req_filebacked(struct loop_device *lo,
+			     struct loop_cmd *cmd, bool *sync)
 {
 	loff_t pos;
 	int ret;
+	struct request *rq = cmd->rq;
 
+	*sync = false;
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
 	if (rq->cmd_flags & REQ_WRITE) {
-		if (rq->cmd_flags & REQ_FLUSH)
+		if (rq->cmd_flags & REQ_FLUSH) {
 			ret = lo_req_flush(lo, rq);
-		else if (rq->cmd_flags & REQ_DISCARD)
+			*sync = true;
+		} else if (rq->cmd_flags & REQ_DISCARD) {
 			ret = lo_discard(lo, rq, pos);
-		else
-			ret = lo_send(lo, rq, pos);
+			*sync = true;
+		} else {
+			ret = lo_io_rw(lo, cmd, true, pos);
+		}
 	} else
-		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
+		ret = lo_io_rw(lo, cmd, false, pos);
 
 	return ret;
 }
@@ -771,6 +849,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	    !file->f_op->write)
 		lo_flags |= LO_FLAGS_READ_ONLY;
 
+#ifdef CONFIG_AIO
+	if (file->f_op->write_iter && file->f_op->read_iter &&
+	    mapping->a_ops->direct_IO) {
+		file->f_flags |= O_DIRECT;
+		lo_flags |= LO_FLAGS_USE_AIO;
+	}
+#endif
+
 	lo_blocksize = S_ISBLK(inode->i_mode) ?
 		inode->i_bdev->bd_block_size : PAGE_SIZE;
 
@@ -804,6 +890,17 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
 	set_blocksize(bdev, lo_blocksize);
 
+#ifdef CONFIG_AIO
+	/*
+	 * We must not send too-small direct-io requests, so we reflect
+	 * the minimum io size to the loop device's logical block size
+	 */
+	if ((lo_flags & LO_FLAGS_USE_AIO) && inode->i_sb->s_bdev)
+		blk_queue_logical_block_size(lo->lo_queue,
+					     bdev_io_min(inode->i_sb->s_bdev));
+#endif
+
+
 	lo->lo_state = Lo_bound;
 	if (part_shift)
 		lo->lo_flags |= LO_FLAGS_PARTSCAN;
@@ -1503,19 +1600,21 @@ static void loop_queue_work(struct work_struct *work)
 	const bool write = cmd->rq->cmd_flags & REQ_WRITE;
 	struct loop_device *lo = cmd->lo;
 	int ret = -EIO;
+	bool sync = true;
 
 	if (lo->lo_state != Lo_bound)
-		goto failed;
+		goto out;
 
 	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
-		goto failed;
+		goto out;
 
-	ret = do_req_filebacked(lo, cmd->rq);
+	ret = do_req_filebacked(lo, cmd, &sync);
 
- failed:
+ out:
 	if (ret)
 		cmd->rq->errors = -EIO;
-	blk_mq_complete_request(cmd->rq);
+	if (!(lo->lo_flags & LO_FLAGS_USE_AIO) || sync || ret)
+		blk_mq_complete_request(cmd->rq);
 }
 
 static int loop_init_request(void *data, struct request *rq,
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index be796c7..4004af5 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -65,6 +65,7 @@ struct loop_cmd {
 	struct work_struct work;
 	struct request *rq;
 	struct loop_device *lo;
+	struct kiocb *iocb;
 };
 
 /* Support for loadable transfer modules */
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index e0cecd2..6edc6b6 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -21,6 +21,7 @@ enum {
 	LO_FLAGS_READ_ONLY	= 1,
 	LO_FLAGS_AUTOCLEAR	= 4,
 	LO_FLAGS_PARTSCAN	= 8,
+	LO_FLAGS_USE_AIO	= 16,
 };
 
 #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
  2014-08-14 15:50 ` Ming Lei
@ 2014-08-14 16:53   ` Jens Axboe
  -1 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-14 16:53 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

On 08/14/2014 09:50 AM, Ming Lei wrote:
> Hi,
> 
> The 1st two patches introduce kernel AIO support, most of
> is borrowed from Dave's work last year, and thanks to ITER_BVEC,
> it is much simper to implement kernel AIO now. AIO model
> is quite suitable for implementing kernel stuff, and can
> help improve both throughput and CPU utilization. Lots of
> kernel components should benefit from it, such as:
> 	- loop driver,
> 	- all kinds of kernel I/O target driver(SCSI, USB storage
> 	or UAS, ...)
> 	- kernel socket users might benefit from it too if socket
> 	AIO is mature
> 
> The following 6 patches convert current loop driver into blk-mq:
> 	- loop's scalability gets improved much
> 	- loop driver gets quite simplified, and the conversion can
> 	be throught as cleanup too
> 
> The 9th patch uses kernel AIO with O_DIRECT to improve loop's
> performance in single job situation, and avoid double cache
> issue for loop driver too.
> 
> With the change, loop block's performance can be doubled in my
> fio test(randread, single job, libaio). If more fio jobs are used,
> the throughput can be improved much more because of blk-mq.
> 
> Given loop is used quite widely, especially in VM environment,
> also the change is quite small, hope it can be merged finally.

Looks pretty damn good to me, even the kernel aio bits are nice and
simple. Do you have any numbers on loop performance before and after the
blk-mq conversion? Just curious.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
@ 2014-08-14 16:53   ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-14 16:53 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

On 08/14/2014 09:50 AM, Ming Lei wrote:
> Hi,
> 
> The 1st two patches introduce kernel AIO support, most of
> is borrowed from Dave's work last year, and thanks to ITER_BVEC,
> it is much simper to implement kernel AIO now. AIO model
> is quite suitable for implementing kernel stuff, and can
> help improve both throughput and CPU utilization. Lots of
> kernel components should benefit from it, such as:
> 	- loop driver,
> 	- all kinds of kernel I/O target driver(SCSI, USB storage
> 	or UAS, ...)
> 	- kernel socket users might benefit from it too if socket
> 	AIO is mature
> 
> The following 6 patches convert current loop driver into blk-mq:
> 	- loop's scalability gets improved much
> 	- loop driver gets quite simplified, and the conversion can
> 	be throught as cleanup too
> 
> The 9th patch uses kernel AIO with O_DIRECT to improve loop's
> performance in single job situation, and avoid double cache
> issue for loop driver too.
> 
> With the change, loop block's performance can be doubled in my
> fio test(randread, single job, libaio). If more fio jobs are used,
> the throughput can be improved much more because of blk-mq.
> 
> Given loop is used quite widely, especially in VM environment,
> also the change is quite small, hope it can be merged finally.

Looks pretty damn good to me, even the kernel aio bits are nice and
simple. Do you have any numbers on loop performance before and after the
blk-mq conversion? Just curious.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 1/9] aio: add aio_kernel_() interface
  2014-08-14 15:50   ` Ming Lei
@ 2014-08-14 18:07     ` Zach Brown
  -1 siblings, 0 replies; 74+ messages in thread
From: Zach Brown @ 2014-08-14 18:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner, Alexander Viro

On Thu, Aug 14, 2014 at 11:50:32PM +0800, Ming Lei wrote:
> From: Dave Kleikamp <dave.kleikamp@oracle.com>
> 
> This adds an interface that lets kernel callers submit aio iocbs without
> going through the user space syscalls.  This lets kernel callers avoid
> the management limits and overhead of the context.  It will also let us
> integrate aio operations with other kernel apis that the user space
> interface doesn't have access to.
> 
> This patch is based on Dave's posts in below links:
> 
> 	https://lkml.org/lkml/2013/10/16/365
> 	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ

(And some other werido's posts, almost 5 entire earth years ago:
 http://permalink.gmane.org/gmane.linux.file-systems/36246)

> +struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra)
> +{
> +	return kzalloc(sizeof(struct kiocb) + extra, gfp);

Is kzalloc really necessary?  It's insane, but in the past we've had
people whine about the cycle costs of zeroing fields that are to be
initialized:

	commit 23aee091d804efa8cc732a31c1ae5d625e1ec886
	Author: Jeff Moyer <jmoyer@redhat.com>
	Date:   Tue Dec 15 16:47:49 2009 -0800

	    dio: don't zero out the pages array inside struct dio

Maybe add a guard value to the ctx and have submission freak out of it's
called without being initialized?  If callers really want to zero they
can pass in __GFP_ZERO.

The extra allocation at the end that's freed is nice, but the callers
having a clumsy manual cast to access it isn't nice at all.  Can you add
a little helper to get a pointer to the extra allocation?    That'd let
the aio bits allocation the iocbs however the like (slab, per-cpu,
whatever) and have extra allocations separate if that ends up making
sense.

> +	iocb->ki_ctx = (void *)-1;

The magic -1 is gross.  Use a constant?  (bonus points for having it use
ERR_PTR() :))

> +	/*
> +	 * use same policy with userspace aio, req may have been
> +	 * completed already, so release it by aio completion.
> +	 */
> +	if (ret != -EIOCBQUEUED)
> +		iocb->ki_obj.complete(iocb->ki_user_data, ret);

I wonder if this needs to handle the restarting error codes like
aio_complete() does.

	commit a0c42bac79731276c9b2f28d54f9e658fcf843a2
	Author: Jan Kara <jack@suse.cz>
	Date:   Wed Sep 22 13:05:03 2010 -0700

	    aio: do not return ERESTARTSYS as a result of AIO

I like how this has evolved to get rid of the magic key and commands..
just the ki_ctx and calling iter methods, nice stuff.

- z

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 1/9] aio: add aio_kernel_() interface
@ 2014-08-14 18:07     ` Zach Brown
  0 siblings, 0 replies; 74+ messages in thread
From: Zach Brown @ 2014-08-14 18:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner, Alexander Viro

On Thu, Aug 14, 2014 at 11:50:32PM +0800, Ming Lei wrote:
> From: Dave Kleikamp <dave.kleikamp@oracle.com>
> 
> This adds an interface that lets kernel callers submit aio iocbs without
> going through the user space syscalls.  This lets kernel callers avoid
> the management limits and overhead of the context.  It will also let us
> integrate aio operations with other kernel apis that the user space
> interface doesn't have access to.
> 
> This patch is based on Dave's posts in below links:
> 
> 	https://lkml.org/lkml/2013/10/16/365
> 	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ

(And some other werido's posts, almost 5 entire earth years ago:
 http://permalink.gmane.org/gmane.linux.file-systems/36246)

> +struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra)
> +{
> +	return kzalloc(sizeof(struct kiocb) + extra, gfp);

Is kzalloc really necessary?  It's insane, but in the past we've had
people whine about the cycle costs of zeroing fields that are to be
initialized:

	commit 23aee091d804efa8cc732a31c1ae5d625e1ec886
	Author: Jeff Moyer <jmoyer@redhat.com>
	Date:   Tue Dec 15 16:47:49 2009 -0800

	    dio: don't zero out the pages array inside struct dio

Maybe add a guard value to the ctx and have submission freak out of it's
called without being initialized?  If callers really want to zero they
can pass in __GFP_ZERO.

The extra allocation at the end that's freed is nice, but the callers
having a clumsy manual cast to access it isn't nice at all.  Can you add
a little helper to get a pointer to the extra allocation?    That'd let
the aio bits allocation the iocbs however the like (slab, per-cpu,
whatever) and have extra allocations separate if that ends up making
sense.

> +	iocb->ki_ctx = (void *)-1;

The magic -1 is gross.  Use a constant?  (bonus points for having it use
ERR_PTR() :))

> +	/*
> +	 * use same policy with userspace aio, req may have been
> +	 * completed already, so release it by aio completion.
> +	 */
> +	if (ret != -EIOCBQUEUED)
> +		iocb->ki_obj.complete(iocb->ki_user_data, ret);

I wonder if this needs to handle the restarting error codes like
aio_complete() does.

	commit a0c42bac79731276c9b2f28d54f9e658fcf843a2
	Author: Jan Kara <jack@suse.cz>
	Date:   Wed Sep 22 13:05:03 2010 -0700

	    aio: do not return ERESTARTSYS as a result of AIO

I like how this has evolved to get rid of the magic key and commands..
just the ki_ctx and calling iter methods, nice stuff.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
  2014-08-14 16:53   ` Jens Axboe
@ 2014-08-15 12:59     ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-15 12:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On Fri, Aug 15, 2014 at 12:53 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 08/14/2014 09:50 AM, Ming Lei wrote:
>> Hi,
>>
>> The 1st two patches introduce kernel AIO support, most of
>> is borrowed from Dave's work last year, and thanks to ITER_BVEC,
>> it is much simper to implement kernel AIO now. AIO model
>> is quite suitable for implementing kernel stuff, and can
>> help improve both throughput and CPU utilization. Lots of
>> kernel components should benefit from it, such as:
>>       - loop driver,
>>       - all kinds of kernel I/O target driver(SCSI, USB storage
>>       or UAS, ...)
>>       - kernel socket users might benefit from it too if socket
>>       AIO is mature
>>
>> The following 6 patches convert current loop driver into blk-mq:
>>       - loop's scalability gets improved much
>>       - loop driver gets quite simplified, and the conversion can
>>       be throught as cleanup too
>>
>> The 9th patch uses kernel AIO with O_DIRECT to improve loop's
>> performance in single job situation, and avoid double cache
>> issue for loop driver too.
>>
>> With the change, loop block's performance can be doubled in my
>> fio test(randread, single job, libaio). If more fio jobs are used,
>> the throughput can be improved much more because of blk-mq.
>>
>> Given loop is used quite widely, especially in VM environment,
>> also the change is quite small, hope it can be merged finally.
>
> Looks pretty damn good to me, even the kernel aio bits are nice and
> simple. Do you have any numbers on loop performance before and after the
> blk-mq conversion? Just curious.

Your concern is right, previous I thought that mq conversion wouldn't improve
throughput, but I did ignore workqueue's concurrency management, it
turns out blk-mq conversion can improvment throughput close to 10 times in
my test(loop over virtio-blk which is backed by one image on SSD).  It is like
POSIX style AIO after mq conversion thanks to workqueue, and I need to
update the performance data in V2.

Actually kernel AIO needn't such high concurrency.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
@ 2014-08-15 12:59     ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-15 12:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On Fri, Aug 15, 2014 at 12:53 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 08/14/2014 09:50 AM, Ming Lei wrote:
>> Hi,
>>
>> The 1st two patches introduce kernel AIO support, most of
>> is borrowed from Dave's work last year, and thanks to ITER_BVEC,
>> it is much simper to implement kernel AIO now. AIO model
>> is quite suitable for implementing kernel stuff, and can
>> help improve both throughput and CPU utilization. Lots of
>> kernel components should benefit from it, such as:
>>       - loop driver,
>>       - all kinds of kernel I/O target driver(SCSI, USB storage
>>       or UAS, ...)
>>       - kernel socket users might benefit from it too if socket
>>       AIO is mature
>>
>> The following 6 patches convert current loop driver into blk-mq:
>>       - loop's scalability gets improved much
>>       - loop driver gets quite simplified, and the conversion can
>>       be throught as cleanup too
>>
>> The 9th patch uses kernel AIO with O_DIRECT to improve loop's
>> performance in single job situation, and avoid double cache
>> issue for loop driver too.
>>
>> With the change, loop block's performance can be doubled in my
>> fio test(randread, single job, libaio). If more fio jobs are used,
>> the throughput can be improved much more because of blk-mq.
>>
>> Given loop is used quite widely, especially in VM environment,
>> also the change is quite small, hope it can be merged finally.
>
> Looks pretty damn good to me, even the kernel aio bits are nice and
> simple. Do you have any numbers on loop performance before and after the
> blk-mq conversion? Just curious.

Your concern is right, previous I thought that mq conversion wouldn't improve
throughput, but I did ignore workqueue's concurrency management, it
turns out blk-mq conversion can improvment throughput close to 10 times in
my test(loop over virtio-blk which is backed by one image on SSD).  It is like
POSIX style AIO after mq conversion thanks to workqueue, and I need to
update the performance data in V2.

Actually kernel AIO needn't such high concurrency.

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
  2014-08-15 12:59     ` Ming Lei
@ 2014-08-15 13:11       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2014-08-15 13:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Christoph Hellwig,
	Kent Overstreet, open list:AIO, Linux FS Devel, Dave Chinner

On Fri, Aug 15, 2014 at 08:59:56PM +0800, Ming Lei wrote:
> Your concern is right, previous I thought that mq conversion wouldn't improve
> throughput, but I did ignore workqueue's concurrency management, it
> turns out blk-mq conversion can improvment throughput close to 10 times in
> my test(loop over virtio-blk which is backed by one image on SSD).  It is like
> POSIX style AIO after mq conversion thanks to workqueue, and I need to
> update the performance data in V2.
> 
> Actually kernel AIO needn't such high concurrency.

Can you juse send a loop blk-mq conversion for now?  I think that's
a bit less controversial than the new kernel aio APIs, and keeping the
two separate is a good idea in general.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
@ 2014-08-15 13:11       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2014-08-15 13:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Christoph Hellwig,
	Kent Overstreet, open list:AIO, Linux FS Devel, Dave Chinner

On Fri, Aug 15, 2014 at 08:59:56PM +0800, Ming Lei wrote:
> Your concern is right, previous I thought that mq conversion wouldn't improve
> throughput, but I did ignore workqueue's concurrency management, it
> turns out blk-mq conversion can improvment throughput close to 10 times in
> my test(loop over virtio-blk which is backed by one image on SSD).  It is like
> POSIX style AIO after mq conversion thanks to workqueue, and I need to
> update the performance data in V2.
> 
> Actually kernel AIO needn't such high concurrency.

Can you juse send a loop blk-mq conversion for now?  I think that's
a bit less controversial than the new kernel aio APIs, and keeping the
two separate is a good idea in general.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 1/9] aio: add aio_kernel_() interface
  2014-08-14 18:07     ` Zach Brown
@ 2014-08-15 13:20       ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-15 13:20 UTC (permalink / raw)
  To: Zach Brown
  Cc: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner, Alexander Viro

On 8/15/14, Zach Brown <zab@zabbo.net> wrote:
> On Thu, Aug 14, 2014 at 11:50:32PM +0800, Ming Lei wrote:
>> From: Dave Kleikamp <dave.kleikamp@oracle.com>
>>
>> This adds an interface that lets kernel callers submit aio iocbs without
>> going through the user space syscalls.  This lets kernel callers avoid
>> the management limits and overhead of the context.  It will also let us
>> integrate aio operations with other kernel apis that the user space
>> interface doesn't have access to.
>>
>> This patch is based on Dave's posts in below links:
>>
>> 	https://lkml.org/lkml/2013/10/16/365
>> 	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ
>
> (And some other werido's posts, almost 5 entire earth years ago:
>  http://permalink.gmane.org/gmane.linux.file-systems/36246)

Wow, thank you guys for proposing the idea so early.

Care to add your Signed-off-by? And Dave

>
>> +struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra)
>> +{
>> +	return kzalloc(sizeof(struct kiocb) + extra, gfp);
>
> Is kzalloc really necessary?  It's insane, but in the past we've had

You are right, and kmalloc should be enough.

> people whine about the cycle costs of zeroing fields that are to be
> initialized:
>
> 	commit 23aee091d804efa8cc732a31c1ae5d625e1ec886
> 	Author: Jeff Moyer <jmoyer@redhat.com>
> 	Date:   Tue Dec 15 16:47:49 2009 -0800
>
> 	    dio: don't zero out the pages array inside struct dio
>
> Maybe add a guard value to the ctx and have submission freak out of it's
> called without being initialized?  If callers really want to zero they
> can pass in __GFP_ZERO.

At least now, other fields won't be touched in kernel AIO path, and they
can be handled in future if need.

>
> The extra allocation at the end that's freed is nice, but the callers
> having a clumsy manual cast to access it isn't nice at all.  Can you add
> a little helper to get a pointer to the extra allocation?    That'd let

OK.

> the aio bits allocation the iocbs however the like (slab, per-cpu,
> whatever) and have extra allocations separate if that ends up making
> sense.

In the future, maybe it is needed for sake of performance, and now it is OK
to not introduce the extra complexity.

>
>> +	iocb->ki_ctx = (void *)-1;
>
> The magic -1 is gross.  Use a constant?  (bonus points for having it use
> ERR_PTR() :))

OK.

>
>> +	/*
>> +	 * use same policy with userspace aio, req may have been
>> +	 * completed already, so release it by aio completion.
>> +	 */
>> +	if (ret != -EIOCBQUEUED)
>> +		iocb->ki_obj.complete(iocb->ki_user_data, ret);
>
> I wonder if this needs to handle the restarting error codes like
> aio_complete() does.

For same reason, kernel path needn't restart too, and caller won't
see such error code.

>
> 	commit a0c42bac79731276c9b2f28d54f9e658fcf843a2
> 	Author: Jan Kara <jack@suse.cz>
> 	Date:   Wed Sep 22 13:05:03 2010 -0700
>
> 	    aio: do not return ERESTARTSYS as a result of AIO
>
> I like how this has evolved to get rid of the magic key and commands..
> just the ki_ctx and calling iter methods, nice stuff.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 1/9] aio: add aio_kernel_() interface
@ 2014-08-15 13:20       ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-15 13:20 UTC (permalink / raw)
  To: Zach Brown
  Cc: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner, Alexander Viro

On 8/15/14, Zach Brown <zab@zabbo.net> wrote:
> On Thu, Aug 14, 2014 at 11:50:32PM +0800, Ming Lei wrote:
>> From: Dave Kleikamp <dave.kleikamp@oracle.com>
>>
>> This adds an interface that lets kernel callers submit aio iocbs without
>> going through the user space syscalls.  This lets kernel callers avoid
>> the management limits and overhead of the context.  It will also let us
>> integrate aio operations with other kernel apis that the user space
>> interface doesn't have access to.
>>
>> This patch is based on Dave's posts in below links:
>>
>> 	https://lkml.org/lkml/2013/10/16/365
>> 	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ
>
> (And some other werido's posts, almost 5 entire earth years ago:
>  http://permalink.gmane.org/gmane.linux.file-systems/36246)

Wow, thank you guys for proposing the idea so early.

Care to add your Signed-off-by? And Dave

>
>> +struct kiocb *aio_kernel_alloc(gfp_t gfp, unsigned extra)
>> +{
>> +	return kzalloc(sizeof(struct kiocb) + extra, gfp);
>
> Is kzalloc really necessary?  It's insane, but in the past we've had

You are right, and kmalloc should be enough.

> people whine about the cycle costs of zeroing fields that are to be
> initialized:
>
> 	commit 23aee091d804efa8cc732a31c1ae5d625e1ec886
> 	Author: Jeff Moyer <jmoyer@redhat.com>
> 	Date:   Tue Dec 15 16:47:49 2009 -0800
>
> 	    dio: don't zero out the pages array inside struct dio
>
> Maybe add a guard value to the ctx and have submission freak out of it's
> called without being initialized?  If callers really want to zero they
> can pass in __GFP_ZERO.

At least now, other fields won't be touched in kernel AIO path, and they
can be handled in future if need.

>
> The extra allocation at the end that's freed is nice, but the callers
> having a clumsy manual cast to access it isn't nice at all.  Can you add
> a little helper to get a pointer to the extra allocation?    That'd let

OK.

> the aio bits allocation the iocbs however the like (slab, per-cpu,
> whatever) and have extra allocations separate if that ends up making
> sense.

In the future, maybe it is needed for sake of performance, and now it is OK
to not introduce the extra complexity.

>
>> +	iocb->ki_ctx = (void *)-1;
>
> The magic -1 is gross.  Use a constant?  (bonus points for having it use
> ERR_PTR() :))

OK.

>
>> +	/*
>> +	 * use same policy with userspace aio, req may have been
>> +	 * completed already, so release it by aio completion.
>> +	 */
>> +	if (ret != -EIOCBQUEUED)
>> +		iocb->ki_obj.complete(iocb->ki_user_data, ret);
>
> I wonder if this needs to handle the restarting error codes like
> aio_complete() does.

For same reason, kernel path needn't restart too, and caller won't
see such error code.

>
> 	commit a0c42bac79731276c9b2f28d54f9e658fcf843a2
> 	Author: Jan Kara <jack@suse.cz>
> 	Date:   Wed Sep 22 13:05:03 2010 -0700
>
> 	    aio: do not return ERESTARTSYS as a result of AIO
>
> I like how this has evolved to get rid of the magic key and commands..
> just the ki_ctx and calling iter methods, nice stuff.


Thanks,
-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
  2014-08-15 13:11       ` Christoph Hellwig
@ 2014-08-15 14:32         ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-15 14:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On 8/15/14, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 15, 2014 at 08:59:56PM +0800, Ming Lei wrote:
>> Your concern is right, previous I thought that mq conversion wouldn't
>> improve
>> throughput, but I did ignore workqueue's concurrency management, it
>> turns out blk-mq conversion can improvment throughput close to 10 times
>> in
>> my test(loop over virtio-blk which is backed by one image on SSD).  It is
>> like
>> POSIX style AIO after mq conversion thanks to workqueue, and I need to
>> update the performance data in V2.
>>
>> Actually kernel AIO needn't such high concurrency.
>
> Can you juse send a loop blk-mq conversion for now?  I think that's

OK.

> a bit less controversial than the new kernel aio APIs, and keeping the
> two separate is a good idea in general.

Exactly, the two should be perfect pair for loop block, IMO.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion
@ 2014-08-15 14:32         ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-15 14:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On 8/15/14, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 15, 2014 at 08:59:56PM +0800, Ming Lei wrote:
>> Your concern is right, previous I thought that mq conversion wouldn't
>> improve
>> throughput, but I did ignore workqueue's concurrency management, it
>> turns out blk-mq conversion can improvment throughput close to 10 times
>> in
>> my test(loop over virtio-blk which is backed by one image on SSD).  It is
>> like
>> POSIX style AIO after mq conversion thanks to workqueue, and I need to
>> update the performance data in V2.
>>
>> Actually kernel AIO needn't such high concurrency.
>
> Can you juse send a loop blk-mq conversion for now?  I think that's

OK.

> a bit less controversial than the new kernel aio APIs, and keeping the
> two separate is a good idea in general.

Exactly, the two should be perfect pair for loop block, IMO.

Thanks,
-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
  2014-08-14 15:50   ` Ming Lei
@ 2014-08-15 16:19     ` Jens Axboe
  -1 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-15 16:19 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

On 08/14/2014 09:50 AM, Ming Lei wrote:
> Currently pdu of the flush rq is simlpy copied from another rq,
> it isn't enough to initialize pointer field well, so introduce
> the callback for driver to handle the case easily.

This is the only patch I don't really like. Can't we make do with
calling ->init_request() for this instead of having to add another
(weird) hook?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
@ 2014-08-15 16:19     ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-15 16:19 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Andrew Morton, Dave Kleikamp
  Cc: Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

On 08/14/2014 09:50 AM, Ming Lei wrote:
> Currently pdu of the flush rq is simlpy copied from another rq,
> it isn't enough to initialize pointer field well, so introduce
> the callback for driver to handle the case easily.

This is the only patch I don't really like. Can't we make do with
calling ->init_request() for this instead of having to add another
(weird) hook?

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-14 15:50   ` Ming Lei
@ 2014-08-15 16:31     ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2014-08-15 16:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

> +
> +static int loop_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
> +			  unsigned int index)
> +{
> +	struct loop_device *lo = data;
> +
> +	hctx->driver_data = lo;

I don't think there is much of a point to store this in the hctx
instead of relying on the queue.

> +static void loop_softirq_done_fn(struct request *rq)
> +{
> +	blk_mq_end_io(rq, rq->errors);
> +}

no need for a noop softirq done function.

> +static void loop_queue_work(struct work_struct *work)

Offloading work straight to a workqueue dosn't make much sense
in the blk-mq model as we'll usually be called from one.  If you
need to avoid the cases where we are called directly a flag for
the blk-mq code to always schedule a workqueue sounds like a much
better plan.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-15 16:31     ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2014-08-15 16:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Christoph Hellwig, Kent Overstreet,
	linux-aio, linux-fsdevel, Dave Chinner

> +
> +static int loop_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
> +			  unsigned int index)
> +{
> +	struct loop_device *lo = data;
> +
> +	hctx->driver_data = lo;

I don't think there is much of a point to store this in the hctx
instead of relying on the queue.

> +static void loop_softirq_done_fn(struct request *rq)
> +{
> +	blk_mq_end_io(rq, rq->errors);
> +}

no need for a noop softirq done function.

> +static void loop_queue_work(struct work_struct *work)

Offloading work straight to a workqueue dosn't make much sense
in the blk-mq model as we'll usually be called from one.  If you
need to avoid the cases where we are called directly a flag for
the blk-mq code to always schedule a workqueue sounds like a much
better plan.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-15 16:31     ` Christoph Hellwig
  (?)
@ 2014-08-15 16:36     ` Jens Axboe
  2014-08-15 16:46       ` Jens Axboe
  -1 siblings, 1 reply; 74+ messages in thread
From: Jens Axboe @ 2014-08-15 16:36 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: linux-kernel, Andrew Morton, Dave Kleikamp, Zach Brown,
	Benjamin LaHaise, Kent Overstreet, linux-aio, linux-fsdevel,
	Dave Chinner

On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>> +static void loop_queue_work(struct work_struct *work)
> 
> Offloading work straight to a workqueue dosn't make much sense
> in the blk-mq model as we'll usually be called from one.  If you
> need to avoid the cases where we are called directly a flag for
> the blk-mq code to always schedule a workqueue sounds like a much
> better plan.

That's a good point - would clean up this bit, and be pretty close to a
one-liner to support in blk-mq for the drivers that always need blocking
context.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-15 16:36     ` Jens Axboe
@ 2014-08-15 16:46       ` Jens Axboe
  2014-08-16  8:06           ` Ming Lei
  0 siblings, 1 reply; 74+ messages in thread
From: Jens Axboe @ 2014-08-15 16:46 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: linux-kernel, Andrew Morton, Dave Kleikamp, Zach Brown,
	Benjamin LaHaise, Kent Overstreet, linux-aio, linux-fsdevel,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 825 bytes --]

On 08/15/2014 10:36 AM, Jens Axboe wrote:
> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>> +static void loop_queue_work(struct work_struct *work)
>>
>> Offloading work straight to a workqueue dosn't make much sense
>> in the blk-mq model as we'll usually be called from one.  If you
>> need to avoid the cases where we are called directly a flag for
>> the blk-mq code to always schedule a workqueue sounds like a much
>> better plan.
> 
> That's a good point - would clean up this bit, and be pretty close to a
> one-liner to support in blk-mq for the drivers that always need blocking
> context.

Something like this should do the trick - totally untested. But with
that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
flags and it could always do the work inline from ->queue_rq().

-- 
Jens Axboe


[-- Attachment #2: blk-mq-wq.patch --]
[-- Type: text/x-patch, Size: 1230 bytes --]

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5189cb1e478a..a97eb9a4af60 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -803,6 +803,9 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
 		return;
 
+	if (hctx->flags & BLK_MQ_F_WQ_CONTEXT)
+		async = true;
+
 	if (!async && cpumask_test_cpu(smp_processor_id(), hctx->cpumask))
 		__blk_mq_run_hw_queue(hctx);
 	else if (hctx->queue->nr_hw_queues == 1)
@@ -1173,7 +1176,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto run_queue;
 	}
 
-	if (is_sync) {
+	if (is_sync && !(data.hctx->flags & BLK_MQ_F_WQ_CONTEXT)) {
 		int ret;
 
 		blk_mq_bio_to_request(rq, bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index eb726b9c5762..c7a8c5fdd380 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -127,7 +127,8 @@ enum {
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
-	BLK_MQ_F_SHOULD_SORT	= 1 << 1,
+	BLK_MQ_F_WQ_CONTEXT	= 1 << 1,	/* ->queue_rq() must run from
+						 * a blocking context */
 	BLK_MQ_F_TAG_SHARED	= 1 << 2,
 	BLK_MQ_F_SG_MERGE	= 1 << 3,
 	BLK_MQ_F_SYSFS_UP	= 1 << 4,

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
  2014-08-15 16:19     ` Jens Axboe
@ 2014-08-16  7:49       ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-16  7:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Andrew Morton, Dave Kleikamp, Zach Brown,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
> On 08/14/2014 09:50 AM, Ming Lei wrote:
>> Currently pdu of the flush rq is simlpy copied from another rq,
>> it isn't enough to initialize pointer field well, so introduce
>> the callback for driver to handle the case easily.
>
> This is the only patch I don't really like. Can't we make do with
> calling ->init_request() for this instead of having to add another
> (weird) hook?

I considered ->init_request() before, but looks there are some problems:

- from API view, both 'hctx_idx' and 'request_idx' parameter don't make
sense for flush rq since it beongs to request queue instead of any one
of hctx

- init_request()/exit_request() are weird too, since they will be called
for queuing every flush req, and they should have been called one
shot

- it is called before queuing each flush req, and might introduce a bit cost
unnecessarily if init_request does lots of stuff

- using init_request may break some current drivers(like scsi)

Now I feel ->init_flush_rq() isn't good too, how about introducing
prepare_flush_rq() and unprepare_flush_rq()? And they can be
lightweight and have document benefit at least.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
@ 2014-08-16  7:49       ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-16  7:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Andrew Morton, Dave Kleikamp, Zach Brown,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
> On 08/14/2014 09:50 AM, Ming Lei wrote:
>> Currently pdu of the flush rq is simlpy copied from another rq,
>> it isn't enough to initialize pointer field well, so introduce
>> the callback for driver to handle the case easily.
>
> This is the only patch I don't really like. Can't we make do with
> calling ->init_request() for this instead of having to add another
> (weird) hook?

I considered ->init_request() before, but looks there are some problems:

- from API view, both 'hctx_idx' and 'request_idx' parameter don't make
sense for flush rq since it beongs to request queue instead of any one
of hctx

- init_request()/exit_request() are weird too, since they will be called
for queuing every flush req, and they should have been called one
shot

- it is called before queuing each flush req, and might introduce a bit cost
unnecessarily if init_request does lots of stuff

- using init_request may break some current drivers(like scsi)

Now I feel ->init_flush_rq() isn't good too, how about introducing
prepare_flush_rq() and unprepare_flush_rq()? And they can be
lightweight and have document benefit at least.

Thanks,
-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-15 16:46       ` Jens Axboe
@ 2014-08-16  8:06           ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-16  8:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-kernel, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>> +static void loop_queue_work(struct work_struct *work)
>>>
>>> Offloading work straight to a workqueue dosn't make much sense
>>> in the blk-mq model as we'll usually be called from one.  If you
>>> need to avoid the cases where we are called directly a flag for
>>> the blk-mq code to always schedule a workqueue sounds like a much
>>> better plan.
>>
>> That's a good point - would clean up this bit, and be pretty close to a
>> one-liner to support in blk-mq for the drivers that always need blocking
>> context.
>
> Something like this should do the trick - totally untested. But with
> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
> flags and it could always do the work inline from ->queue_rq().

I think it is a good idea.

But for loop, there may be two problems:

- default max_active for bound workqueue is 256, which means several slow
loop devices might slow down whole block system. With kernel AIO, it won't
be a big deal, but some block/fs may not support direct I/O and still
fallback to
workqueue

- 6. Guidelines of Documentation/workqueue.txt
If there is dependency among multiple work items used during memory
reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-16  8:06           ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-16  8:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-kernel, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>> +static void loop_queue_work(struct work_struct *work)
>>>
>>> Offloading work straight to a workqueue dosn't make much sense
>>> in the blk-mq model as we'll usually be called from one.  If you
>>> need to avoid the cases where we are called directly a flag for
>>> the blk-mq code to always schedule a workqueue sounds like a much
>>> better plan.
>>
>> That's a good point - would clean up this bit, and be pretty close to a
>> one-liner to support in blk-mq for the drivers that always need blocking
>> context.
>
> Something like this should do the trick - totally untested. But with
> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
> flags and it could always do the work inline from ->queue_rq().

I think it is a good idea.

But for loop, there may be two problems:

- default max_active for bound workqueue is 256, which means several slow
loop devices might slow down whole block system. With kernel AIO, it won't
be a big deal, but some block/fs may not support direct I/O and still
fallback to
workqueue

- 6. Guidelines of Documentation/workqueue.txt
If there is dependency among multiple work items used during memory
reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.


Thanks,
-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-16  8:06           ` Ming Lei
@ 2014-08-17 17:48             ` Jens Axboe
  -1 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-17 17:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-kernel, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 2014-08-16 02:06, Ming Lei wrote:
> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>
>>>> Offloading work straight to a workqueue dosn't make much sense
>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>> need to avoid the cases where we are called directly a flag for
>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>> better plan.
>>>
>>> That's a good point - would clean up this bit, and be pretty close to a
>>> one-liner to support in blk-mq for the drivers that always need blocking
>>> context.
>>
>> Something like this should do the trick - totally untested. But with
>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>> flags and it could always do the work inline from ->queue_rq().
>
> I think it is a good idea.
>
> But for loop, there may be two problems:
>
> - default max_active for bound workqueue is 256, which means several slow
> loop devices might slow down whole block system. With kernel AIO, it won't
> be a big deal, but some block/fs may not support direct I/O and still
> fallback to
> workqueue
>
> - 6. Guidelines of Documentation/workqueue.txt
> If there is dependency among multiple work items used during memory
> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.

Both are good points. But I think this mainly means that we should 
support this through a potentially per-dispatch queue workqueue, 
separate from kblockd. There's no reason blk-mq can't support this with 
a per-hctx workqueue, for drivers that need it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-17 17:48             ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-17 17:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-kernel, Andrew Morton, Dave Kleikamp,
	Zach Brown, Benjamin LaHaise, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 2014-08-16 02:06, Ming Lei wrote:
> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>
>>>> Offloading work straight to a workqueue dosn't make much sense
>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>> need to avoid the cases where we are called directly a flag for
>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>> better plan.
>>>
>>> That's a good point - would clean up this bit, and be pretty close to a
>>> one-liner to support in blk-mq for the drivers that always need blocking
>>> context.
>>
>> Something like this should do the trick - totally untested. But with
>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>> flags and it could always do the work inline from ->queue_rq().
>
> I think it is a good idea.
>
> But for loop, there may be two problems:
>
> - default max_active for bound workqueue is 256, which means several slow
> loop devices might slow down whole block system. With kernel AIO, it won't
> be a big deal, but some block/fs may not support direct I/O and still
> fallback to
> workqueue
>
> - 6. Guidelines of Documentation/workqueue.txt
> If there is dependency among multiple work items used during memory
> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.

Both are good points. But I think this mainly means that we should 
support this through a potentially per-dispatch queue workqueue, 
separate from kblockd. There's no reason blk-mq can't support this with 
a per-hctx workqueue, for drivers that need it.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
  2014-08-16  7:49       ` Ming Lei
@ 2014-08-17 18:39         ` Jens Axboe
  -1 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-17 18:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Andrew Morton, Dave Kleikamp, Zach Brown,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 2014-08-16 01:49, Ming Lei wrote:
> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>> On 08/14/2014 09:50 AM, Ming Lei wrote:
>>> Currently pdu of the flush rq is simlpy copied from another rq,
>>> it isn't enough to initialize pointer field well, so introduce
>>> the callback for driver to handle the case easily.
>>
>> This is the only patch I don't really like. Can't we make do with
>> calling ->init_request() for this instead of having to add another
>> (weird) hook?
>
> I considered ->init_request() before, but looks there are some problems:
>
> - from API view, both 'hctx_idx' and 'request_idx' parameter don't make
> sense for flush rq since it beongs to request queue instead of any one
> of hctx
>
> - init_request()/exit_request() are weird too, since they will be called
> for queuing every flush req, and they should have been called one
> shot
>
> - it is called before queuing each flush req, and might introduce a bit cost
> unnecessarily if init_request does lots of stuff
>
> - using init_request may break some current drivers(like scsi)
>
> Now I feel ->init_flush_rq() isn't good too, how about introducing
> prepare_flush_rq() and unprepare_flush_rq()? And they can be
> lightweight and have document benefit at least.

I think that's a better approach, lets do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops'
@ 2014-08-17 18:39         ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-17 18:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Andrew Morton, Dave Kleikamp, Zach Brown,
	Benjamin LaHaise, Christoph Hellwig, Kent Overstreet, linux-aio,
	linux-fsdevel, Dave Chinner

On 2014-08-16 01:49, Ming Lei wrote:
> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>> On 08/14/2014 09:50 AM, Ming Lei wrote:
>>> Currently pdu of the flush rq is simlpy copied from another rq,
>>> it isn't enough to initialize pointer field well, so introduce
>>> the callback for driver to handle the case easily.
>>
>> This is the only patch I don't really like. Can't we make do with
>> calling ->init_request() for this instead of having to add another
>> (weird) hook?
>
> I considered ->init_request() before, but looks there are some problems:
>
> - from API view, both 'hctx_idx' and 'request_idx' parameter don't make
> sense for flush rq since it beongs to request queue instead of any one
> of hctx
>
> - init_request()/exit_request() are weird too, since they will be called
> for queuing every flush req, and they should have been called one
> shot
>
> - it is called before queuing each flush req, and might introduce a bit cost
> unnecessarily if init_request does lots of stuff
>
> - using init_request may break some current drivers(like scsi)
>
> Now I feel ->init_flush_rq() isn't good too, how about introducing
> prepare_flush_rq() and unprepare_flush_rq()? And they can be
> lightweight and have document benefit at least.

I think that's a better approach, lets do that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-17 17:48             ` Jens Axboe
@ 2014-08-18  1:22               ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-18  1:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-08-16 02:06, Ming Lei wrote:
>>
>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>
>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>
>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>
>>>>>
>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>> need to avoid the cases where we are called directly a flag for
>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>> better plan.
>>>>
>>>>
>>>> That's a good point - would clean up this bit, and be pretty close to a
>>>> one-liner to support in blk-mq for the drivers that always need blocking
>>>> context.
>>>
>>>
>>> Something like this should do the trick - totally untested. But with
>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>> flags and it could always do the work inline from ->queue_rq().
>>
>>
>> I think it is a good idea.
>>
>> But for loop, there may be two problems:
>>
>> - default max_active for bound workqueue is 256, which means several slow
>> loop devices might slow down whole block system. With kernel AIO, it won't
>> be a big deal, but some block/fs may not support direct I/O and still
>> fallback to
>> workqueue
>>
>> - 6. Guidelines of Documentation/workqueue.txt
>> If there is dependency among multiple work items used during memory
>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>
>
> Both are good points. But I think this mainly means that we should support
> this through a potentially per-dispatch queue workqueue, separate from
> kblockd. There's no reason blk-mq can't support this with a per-hctx
> workqueue, for drivers that need it.

Good idea, and per-device workqueue should be enough if
BLK_MQ_F_WQ_CONTEXT flag is set.

Thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-18  1:22               ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-18  1:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-08-16 02:06, Ming Lei wrote:
>>
>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>
>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>
>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>
>>>>>
>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>> need to avoid the cases where we are called directly a flag for
>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>> better plan.
>>>>
>>>>
>>>> That's a good point - would clean up this bit, and be pretty close to a
>>>> one-liner to support in blk-mq for the drivers that always need blocking
>>>> context.
>>>
>>>
>>> Something like this should do the trick - totally untested. But with
>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>> flags and it could always do the work inline from ->queue_rq().
>>
>>
>> I think it is a good idea.
>>
>> But for loop, there may be two problems:
>>
>> - default max_active for bound workqueue is 256, which means several slow
>> loop devices might slow down whole block system. With kernel AIO, it won't
>> be a big deal, but some block/fs may not support direct I/O and still
>> fallback to
>> workqueue
>>
>> - 6. Guidelines of Documentation/workqueue.txt
>> If there is dependency among multiple work items used during memory
>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>
>
> Both are good points. But I think this mainly means that we should support
> this through a potentially per-dispatch queue workqueue, separate from
> kblockd. There's no reason blk-mq can't support this with a per-hctx
> workqueue, for drivers that need it.

Good idea, and per-device workqueue should be enough if
BLK_MQ_F_WQ_CONTEXT flag is set.

Thanks,

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-18  1:22               ` Ming Lei
@ 2014-08-18 11:53                 ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-18 11:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On Mon, Aug 18, 2014 at 9:22 AM, Ming Lei <ming.lei@canonical.com> wrote:
> On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-08-16 02:06, Ming Lei wrote:
>>>
>>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>>
>>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>>
>>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>>
>>>>>>
>>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>>> need to avoid the cases where we are called directly a flag for
>>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>>> better plan.
>>>>>
>>>>>
>>>>> That's a good point - would clean up this bit, and be pretty close to a
>>>>> one-liner to support in blk-mq for the drivers that always need blocking
>>>>> context.
>>>>
>>>>
>>>> Something like this should do the trick - totally untested. But with
>>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>>> flags and it could always do the work inline from ->queue_rq().
>>>
>>>
>>> I think it is a good idea.
>>>
>>> But for loop, there may be two problems:
>>>
>>> - default max_active for bound workqueue is 256, which means several slow
>>> loop devices might slow down whole block system. With kernel AIO, it won't
>>> be a big deal, but some block/fs may not support direct I/O and still
>>> fallback to
>>> workqueue
>>>
>>> - 6. Guidelines of Documentation/workqueue.txt
>>> If there is dependency among multiple work items used during memory
>>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>>
>>
>> Both are good points. But I think this mainly means that we should support
>> this through a potentially per-dispatch queue workqueue, separate from
>> kblockd. There's no reason blk-mq can't support this with a per-hctx
>> workqueue, for drivers that need it.
>
> Good idea, and per-device workqueue should be enough if
> BLK_MQ_F_WQ_CONTEXT flag is set.

Maybe for most of cases per-device class(driver) workqueue should be
enough since dependency between devices driven by same driver
isn't common, for example, loop over loop is absolutely insane.

I will keep the work queue in loop-mq V2, and it should be easy to switch
to the mechanism once it is ready.

Thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-18 11:53                 ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-18 11:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO, Linux FS Devel, Dave Chinner

On Mon, Aug 18, 2014 at 9:22 AM, Ming Lei <ming.lei@canonical.com> wrote:
> On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-08-16 02:06, Ming Lei wrote:
>>>
>>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>>
>>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>>
>>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>>
>>>>>>
>>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>>> need to avoid the cases where we are called directly a flag for
>>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>>> better plan.
>>>>>
>>>>>
>>>>> That's a good point - would clean up this bit, and be pretty close to a
>>>>> one-liner to support in blk-mq for the drivers that always need blocking
>>>>> context.
>>>>
>>>>
>>>> Something like this should do the trick - totally untested. But with
>>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>>> flags and it could always do the work inline from ->queue_rq().
>>>
>>>
>>> I think it is a good idea.
>>>
>>> But for loop, there may be two problems:
>>>
>>> - default max_active for bound workqueue is 256, which means several slow
>>> loop devices might slow down whole block system. With kernel AIO, it won't
>>> be a big deal, but some block/fs may not support direct I/O and still
>>> fallback to
>>> workqueue
>>>
>>> - 6. Guidelines of Documentation/workqueue.txt
>>> If there is dependency among multiple work items used during memory
>>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>>
>>
>> Both are good points. But I think this mainly means that we should support
>> this through a potentially per-dispatch queue workqueue, separate from
>> kblockd. There's no reason blk-mq can't support this with a per-hctx
>> workqueue, for drivers that need it.
>
> Good idea, and per-device workqueue should be enough if
> BLK_MQ_F_WQ_CONTEXT flag is set.

Maybe for most of cases per-device class(driver) workqueue should be
enough since dependency between devices driven by same driver
isn't common, for example, loop over loop is absolutely insane.

I will keep the work queue in loop-mq V2, and it should be easy to switch
to the mechanism once it is ready.

Thanks,

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-18 11:53                 ` Ming Lei
  (?)
@ 2014-08-19 20:50                 ` Jens Axboe
  2014-08-20  1:23                   ` Ming Lei
  2014-08-21  5:44                   ` Ming Lei
  -1 siblings, 2 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-19 20:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open, list

On 2014-08-18 06:53, Ming Lei wrote:
> On Mon, Aug 18, 2014 at 9:22 AM, Ming Lei <ming.lei@canonical.com> wrote:
>> On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>> On 2014-08-16 02:06, Ming Lei wrote:
>>>>
>>>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>>>
>>>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>>>
>>>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>>>
>>>>>>>
>>>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>>>> need to avoid the cases where we are called directly a flag for
>>>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>>>> better plan.
>>>>>>
>>>>>>
>>>>>> That's a good point - would clean up this bit, and be pretty close to a
>>>>>> one-liner to support in blk-mq for the drivers that always need blocking
>>>>>> context.
>>>>>
>>>>>
>>>>> Something like this should do the trick - totally untested. But with
>>>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>>>> flags and it could always do the work inline from ->queue_rq().
>>>>
>>>>
>>>> I think it is a good idea.
>>>>
>>>> But for loop, there may be two problems:
>>>>
>>>> - default max_active for bound workqueue is 256, which means several slow
>>>> loop devices might slow down whole block system. With kernel AIO, it won't
>>>> be a big deal, but some block/fs may not support direct I/O and still
>>>> fallback to
>>>> workqueue
>>>>
>>>> - 6. Guidelines of Documentation/workqueue.txt
>>>> If there is dependency among multiple work items used during memory
>>>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>>>
>>>
>>> Both are good points. But I think this mainly means that we should support
>>> this through a potentially per-dispatch queue workqueue, separate from
>>> kblockd. There's no reason blk-mq can't support this with a per-hctx
>>> workqueue, for drivers that need it.
>>
>> Good idea, and per-device workqueue should be enough if
>> BLK_MQ_F_WQ_CONTEXT flag is set.
>
> Maybe for most of cases per-device class(driver) workqueue should be
> enough since dependency between devices driven by same driver
> isn't common, for example, loop over loop is absolutely insane.

It's insane, but it can happen. And given how cheap it is to do a 
workqueue, I don't see a reason why we should not. Loop over loop might 
seem nutty, but it's not that far out into the realm of nutty things 
that people end up doing.

> I will keep the work queue in loop-mq V2, and it should be easy to switch
> to the mechanism once it is ready.

Reworked a bit more:

http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6

Lets base loop-mq on the blk-mq workqueues, it would simplify it quite a 
bit and I don't think there's much point in doing v1 and then ripping it 
out for v2. Especially since it isn't queued up for 3.18 yet.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-19 20:50                 ` Jens Axboe
@ 2014-08-20  1:23                   ` Ming Lei
  2014-08-20 16:09                     ` Jens Axboe
  2014-08-21  5:44                   ` Ming Lei
  1 sibling, 1 reply; 74+ messages in thread
From: Ming Lei @ 2014-08-20  1:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO <linux-aio@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner, Tejun Heo

On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-08-18 06:53, Ming Lei wrote:
>>
>> On Mon, Aug 18, 2014 at 9:22 AM, Ming Lei <ming.lei@canonical.com> wrote:
>>>
>>> On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2014-08-16 02:06, Ming Lei wrote:
>>>>>
>>>>>
>>>>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>>
>>>>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>>>>> need to avoid the cases where we are called directly a flag for
>>>>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>>>>> better plan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That's a good point - would clean up this bit, and be pretty close to
>>>>>>> a
>>>>>>> one-liner to support in blk-mq for the drivers that always need
>>>>>>> blocking
>>>>>>> context.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Something like this should do the trick - totally untested. But with
>>>>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>>>>> flags and it could always do the work inline from ->queue_rq().
>>>>>
>>>>>
>>>>>
>>>>> I think it is a good idea.
>>>>>
>>>>> But for loop, there may be two problems:
>>>>>
>>>>> - default max_active for bound workqueue is 256, which means several
>>>>> slow
>>>>> loop devices might slow down whole block system. With kernel AIO, it
>>>>> won't
>>>>> be a big deal, but some block/fs may not support direct I/O and still
>>>>> fallback to
>>>>> workqueue
>>>>>
>>>>> - 6. Guidelines of Documentation/workqueue.txt
>>>>> If there is dependency among multiple work items used during memory
>>>>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>>>>
>>>>
>>>>
>>>> Both are good points. But I think this mainly means that we should
>>>> support
>>>> this through a potentially per-dispatch queue workqueue, separate from
>>>> kblockd. There's no reason blk-mq can't support this with a per-hctx
>>>> workqueue, for drivers that need it.
>>>
>>>
>>> Good idea, and per-device workqueue should be enough if
>>> BLK_MQ_F_WQ_CONTEXT flag is set.
>>
>>
>> Maybe for most of cases per-device class(driver) workqueue should be
>> enough since dependency between devices driven by same driver
>> isn't common, for example, loop over loop is absolutely insane.
>
>
> It's insane, but it can happen. And given how cheap it is to do a workqueue,

Workqueue with WQ_MEM_RECLAIM need to create a standalone kthread
for the queue, so at default there will be 8 kthreads created even no one
uses loop at all.  From current implementation the per-device thread is
created only when one file or blk device is attached to the loop device, which
may not be possible when blk-mq supports per-device workqueue.

> I don't see a reason why we should not. Loop over loop might seem nutty, but
> it's not that far out into the realm of nutty things that people end up
> doing.

Another reason I am still not sure if workqueue is good for loop, though I
do really like workqueue for sake of simplicity, :-)

- sequential read becomes a bit slow with workqueue, especially for some
fast block(such as null_blk)

- random read becomes a bit slow too for some fast devices(such as null_blk)
in some environment(It is reproduced in my server, but can't in my laptop) even
it can improve throughout quite much for common devices(HDD., SSD,..)

>From my investigation, context switch increases almost 50% with
workqueue compared with kthread in loop in a quad-core VM. With
kthread, requests may be handled as batch in cases which won't be
blocked in read()/write()(like null_blk, tmpfs, ...), but it is impossible with
workqueue any more.  Also block plug&unplug should have been used
with kthread to optimize the case, especially when kernel AIO is applied,
still impossible with work queue too.

So looks kthread with kernel AIO is still not bad for the blk-mq conversion,
which can improve throughput much too.  Or other ideas?


Thanks

>
>
>> I will keep the work queue in loop-mq V2, and it should be easy to switch
>> to the mechanism once it is ready.
>
>
> Reworked a bit more:
>
> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
>
> Lets base loop-mq on the blk-mq workqueues, it would simplify it quite a bit
> and I don't think there's much point in doing v1 and then ripping it out for
> v2. Especially since it isn't queued up for 3.18 yet.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-20  1:23                   ` Ming Lei
@ 2014-08-20 16:09                     ` Jens Axboe
  2014-08-21  2:54                       ` Ming Lei
  0 siblings, 1 reply; 74+ messages in thread
From: Jens Axboe @ 2014-08-20 16:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open, list

On 2014-08-19 20:23, Ming Lei wrote:
> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-08-18 06:53, Ming Lei wrote:
>>>
>>> On Mon, Aug 18, 2014 at 9:22 AM, Ming Lei <ming.lei@canonical.com> wrote:
>>>>
>>>> On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 2014-08-16 02:06, Ming Lei wrote:
>>>>>>
>>>>>>
>>>>>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>>>>>> need to avoid the cases where we are called directly a flag for
>>>>>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>>>>>> better plan.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> That's a good point - would clean up this bit, and be pretty close to
>>>>>>>> a
>>>>>>>> one-liner to support in blk-mq for the drivers that always need
>>>>>>>> blocking
>>>>>>>> context.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Something like this should do the trick - totally untested. But with
>>>>>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag set
>>>>>>> flags and it could always do the work inline from ->queue_rq().
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think it is a good idea.
>>>>>>
>>>>>> But for loop, there may be two problems:
>>>>>>
>>>>>> - default max_active for bound workqueue is 256, which means several
>>>>>> slow
>>>>>> loop devices might slow down whole block system. With kernel AIO, it
>>>>>> won't
>>>>>> be a big deal, but some block/fs may not support direct I/O and still
>>>>>> fallback to
>>>>>> workqueue
>>>>>>
>>>>>> - 6. Guidelines of Documentation/workqueue.txt
>>>>>> If there is dependency among multiple work items used during memory
>>>>>> reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
>>>>>
>>>>>
>>>>>
>>>>> Both are good points. But I think this mainly means that we should
>>>>> support
>>>>> this through a potentially per-dispatch queue workqueue, separate from
>>>>> kblockd. There's no reason blk-mq can't support this with a per-hctx
>>>>> workqueue, for drivers that need it.
>>>>
>>>>
>>>> Good idea, and per-device workqueue should be enough if
>>>> BLK_MQ_F_WQ_CONTEXT flag is set.
>>>
>>>
>>> Maybe for most of cases per-device class(driver) workqueue should be
>>> enough since dependency between devices driven by same driver
>>> isn't common, for example, loop over loop is absolutely insane.
>>
>>
>> It's insane, but it can happen. And given how cheap it is to do a workqueue,
>
> Workqueue with WQ_MEM_RECLAIM need to create a standalone kthread
> for the queue, so at default there will be 8 kthreads created even no one
> uses loop at all.  From current implementation the per-device thread is
> created only when one file or blk device is attached to the loop device, which
> may not be possible when blk-mq supports per-device workqueue.

That is true, but I don't see this as a huge problem. And idle kthread 
is pretty much free...

>> I don't see a reason why we should not. Loop over loop might seem nutty, but
>> it's not that far out into the realm of nutty things that people end up
>> doing.
>
> Another reason I am still not sure if workqueue is good for loop, though I
> do really like workqueue for sake of simplicity, :-)
>
> - sequential read becomes a bit slow with workqueue, especially for some
> fast block(such as null_blk)
>
> - random read becomes a bit slow too for some fast devices(such as null_blk)
> in some environment(It is reproduced in my server, but can't in my laptop) even
> it can improve throughout quite much for common devices(HDD., SSD,..)

Thread offloading will always slow down some use cases, like sync(ish) 
IO. Not sure this is a case against kthread vs workqueue, performance 
and behavior should be identical here?

>  From my investigation, context switch increases almost 50% with
> workqueue compared with kthread in loop in a quad-core VM. With
> kthread, requests may be handled as batch in cases which won't be
> blocked in read()/write()(like null_blk, tmpfs, ...), but it is impossible with
> workqueue any more.  Also block plug&unplug should have been used
> with kthread to optimize the case, especially when kernel AIO is applied,
> still impossible with work queue too.

OK, that one is actually a good point, since one need not do per-item 
queueing. We could handle different units, though. And we should have 
proper marking of the last item in a chain of stuff, so we might even be 
able to offload based on that instead of doing single items. It wont 
help the sync case, but for that, workqueue and kthread would be identical.

Or we could just provide a better alternative in blk-mq. Doing 
workqueues is just so damn easy, I'd be reluctant to add a kthread pool 
instead. It'd be much better to augment or fix workqueues to work well 
for this case as well.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-20 16:09                     ` Jens Axboe
@ 2014-08-21  2:54                       ` Ming Lei
  2014-08-21  2:58                         ` Jens Axboe
  0 siblings, 1 reply; 74+ messages in thread
From: Ming Lei @ 2014-08-21  2:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO <linux-aio@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Tejun Heo

On Thu, Aug 21, 2014 at 12:09 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-08-19 20:23, Ming Lei wrote:
>>
>> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 2014-08-18 06:53, Ming Lei wrote:
>>>>
>>>>
>>>> On Mon, Aug 18, 2014 at 9:22 AM, Ming Lei <ming.lei@canonical.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Mon, Aug 18, 2014 at 1:48 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>>
>>>>>> On 2014-08-16 02:06, Ming Lei wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/16/14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/15/2014 10:36 AM, Jens Axboe wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/15/2014 10:31 AM, Christoph Hellwig wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +static void loop_queue_work(struct work_struct *work)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Offloading work straight to a workqueue dosn't make much sense
>>>>>>>>>> in the blk-mq model as we'll usually be called from one.  If you
>>>>>>>>>> need to avoid the cases where we are called directly a flag for
>>>>>>>>>> the blk-mq code to always schedule a workqueue sounds like a much
>>>>>>>>>> better plan.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That's a good point - would clean up this bit, and be pretty close
>>>>>>>>> to
>>>>>>>>> a
>>>>>>>>> one-liner to support in blk-mq for the drivers that always need
>>>>>>>>> blocking
>>>>>>>>> context.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Something like this should do the trick - totally untested. But with
>>>>>>>> that, loop would just need to add BLK_MQ_F_WQ_CONTEXT to it's tag
>>>>>>>> set
>>>>>>>> flags and it could always do the work inline from ->queue_rq().
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think it is a good idea.
>>>>>>>
>>>>>>> But for loop, there may be two problems:
>>>>>>>
>>>>>>> - default max_active for bound workqueue is 256, which means several
>>>>>>> slow
>>>>>>> loop devices might slow down whole block system. With kernel AIO, it
>>>>>>> won't
>>>>>>> be a big deal, but some block/fs may not support direct I/O and still
>>>>>>> fallback to
>>>>>>> workqueue
>>>>>>>
>>>>>>> - 6. Guidelines of Documentation/workqueue.txt
>>>>>>> If there is dependency among multiple work items used during memory
>>>>>>> reclaim, they should be queued to separate wq each with
>>>>>>> WQ_MEM_RECLAIM.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Both are good points. But I think this mainly means that we should
>>>>>> support
>>>>>> this through a potentially per-dispatch queue workqueue, separate from
>>>>>> kblockd. There's no reason blk-mq can't support this with a per-hctx
>>>>>> workqueue, for drivers that need it.
>>>>>
>>>>>
>>>>>
>>>>> Good idea, and per-device workqueue should be enough if
>>>>> BLK_MQ_F_WQ_CONTEXT flag is set.
>>>>
>>>>
>>>>
>>>> Maybe for most of cases per-device class(driver) workqueue should be
>>>> enough since dependency between devices driven by same driver
>>>> isn't common, for example, loop over loop is absolutely insane.
>>>
>>>
>>>
>>> It's insane, but it can happen. And given how cheap it is to do a
>>> workqueue,
>>
>>
>> Workqueue with WQ_MEM_RECLAIM need to create a standalone kthread
>> for the queue, so at default there will be 8 kthreads created even no one
>> uses loop at all.  From current implementation the per-device thread is
>> created only when one file or blk device is attached to the loop device,
>> which
>> may not be possible when blk-mq supports per-device workqueue.
>
>
> That is true, but I don't see this as a huge problem. And idle kthread is
> pretty much free...

OK, I am fine with that too if no one complains that, :-)

BTW, loop over loop won't be a problem since loop driver can cut the
dependency and just use the original back file, so one workqueue should
be enough for all loop devices.

>
>
>>> I don't see a reason why we should not. Loop over loop might seem nutty,
>>> but
>>> it's not that far out into the realm of nutty things that people end up
>>> doing.
>>
>>
>> Another reason I am still not sure if workqueue is good for loop, though I
>> do really like workqueue for sake of simplicity, :-)
>>
>> - sequential read becomes a bit slow with workqueue, especially for some
>> fast block(such as null_blk)
>>
>> - random read becomes a bit slow too for some fast devices(such as
>> null_blk)
>> in some environment(It is reproduced in my server, but can't in my laptop)
>> even
>> it can improve throughout quite much for common devices(HDD., SSD,..)
>
>
> Thread offloading will always slow down some use cases, like sync(ish) IO.
> Not sure this is a case against kthread vs workqueue, performance and
> behavior should be identical here?

Looks no sync is involved because I just test randread with fio, and
the cause should be same with below.

>
>
>>  From my investigation, context switch increases almost 50% with
>> workqueue compared with kthread in loop in a quad-core VM. With
>> kthread, requests may be handled as batch in cases which won't be
>> blocked in read()/write()(like null_blk, tmpfs, ...), but it is impossible
>> with
>> workqueue any more.  Also block plug&unplug should have been used
>> with kthread to optimize the case, especially when kernel AIO is applied,
>> still impossible with work queue too.
>
>
> OK, that one is actually a good point, since one need not do per-item
> queueing. We could handle different units, though. And we should have proper
> marking of the last item in a chain of stuff, so we might even be able to
> offload based on that instead of doing single items. It wont help the sync
> case, but for that, workqueue and kthread would be identical.

We may do that by introducing callback of queue_rq_list in blk_mq_ops,
and I will figure out one patch today to see if it can help the case.

> Or we could just provide a better alternative in blk-mq. Doing workqueues is
> just so damn easy, I'd be reluctant to add a kthread pool instead. It'd be
> much better to augment or fix workqueues to work well for this case as well.



Thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-21  2:54                       ` Ming Lei
@ 2014-08-21  2:58                         ` Jens Axboe
  2014-08-21  3:13                           ` Ming Lei
  2014-08-21  3:34                           ` Ming Lei
  0 siblings, 2 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-21  2:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open, list

On 2014-08-20 21:54, Ming Lei wrote:
>>>   From my investigation, context switch increases almost 50% with
>>> workqueue compared with kthread in loop in a quad-core VM. With
>>> kthread, requests may be handled as batch in cases which won't be
>>> blocked in read()/write()(like null_blk, tmpfs, ...), but it is impossible
>>> with
>>> workqueue any more.  Also block plug&unplug should have been used
>>> with kthread to optimize the case, especially when kernel AIO is applied,
>>> still impossible with work queue too.
>>
>>
>> OK, that one is actually a good point, since one need not do per-item
>> queueing. We could handle different units, though. And we should have proper
>> marking of the last item in a chain of stuff, so we might even be able to
>> offload based on that instead of doing single items. It wont help the sync
>> case, but for that, workqueue and kthread would be identical.
>
> We may do that by introducing callback of queue_rq_list in blk_mq_ops,
> and I will figure out one patch today to see if it can help the case.

I don't think we should add to the interface, I prefer keeping it clean 
like it is right now. At least not if we can get around it. My point is 
that the driver already knows when the chain is complete, when REQ_LAST 
is set. So before that event triggers, it need not kick off IO, or at 
least i could do it in batches before that. That may not be fully 
reliable in case of queueing errors, but if REQ_LAST or 'error return' 
is used as the way to kick off pending IO, then that should be good 
enough. Haven't audited this in a while, but at least that is the intent 
of REQ_LAST.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-21  2:58                         ` Jens Axboe
@ 2014-08-21  3:13                           ` Ming Lei
  2014-08-21  3:15                             ` Ming Lei
  2014-08-21  3:16                             ` Jens Axboe
  2014-08-21  3:34                           ` Ming Lei
  1 sibling, 2 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-21  3:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO <linux-aio@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Tejun Heo

On Thu, Aug 21, 2014 at 10:58 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-08-20 21:54, Ming Lei wrote:
>>>>
>>>>   From my investigation, context switch increases almost 50% with
>>>> workqueue compared with kthread in loop in a quad-core VM. With
>>>> kthread, requests may be handled as batch in cases which won't be
>>>> blocked in read()/write()(like null_blk, tmpfs, ...), but it is
>>>> impossible
>>>> with
>>>> workqueue any more.  Also block plug&unplug should have been used
>>>> with kthread to optimize the case, especially when kernel AIO is
>>>> applied,
>>>> still impossible with work queue too.
>>>
>>>
>>>
>>> OK, that one is actually a good point, since one need not do per-item
>>> queueing. We could handle different units, though. And we should have
>>> proper
>>> marking of the last item in a chain of stuff, so we might even be able to
>>> offload based on that instead of doing single items. It wont help the
>>> sync
>>> case, but for that, workqueue and kthread would be identical.
>>
>>
>> We may do that by introducing callback of queue_rq_list in blk_mq_ops,
>> and I will figure out one patch today to see if it can help the case.
>
>
> I don't think we should add to the interface, I prefer keeping it clean like
> it is right now. At least not if we can get around it. My point is that the
> driver already knows when the chain is complete, when REQ_LAST is set. So
> before that event triggers, it need not kick off IO, or at least i could do
> it in batches before that. That may not be fully reliable in case of
> queueing errors, but if REQ_LAST or 'error return' is used as the way to
> kick off pending IO, then that should be good enough. Haven't audited this
> in a while, but at least that is the intent of REQ_LAST.

Yes, I thought of too, but driver need another context for handling that,
either workqueue or kthread, which may cause the introduced per-device
workqueue useless.

thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-21  3:13                           ` Ming Lei
@ 2014-08-21  3:15                             ` Ming Lei
  2014-08-21  3:16                             ` Jens Axboe
  1 sibling, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-21  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO <linux-aio@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Tejun Heo

On Thu, Aug 21, 2014 at 11:13 AM, Ming Lei <ming.lei@canonical.com> wrote:
> On Thu, Aug 21, 2014 at 10:58 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-08-20 21:54, Ming Lei wrote:
>>>>>
>>>>>   From my investigation, context switch increases almost 50% with
>>>>> workqueue compared with kthread in loop in a quad-core VM. With
>>>>> kthread, requests may be handled as batch in cases which won't be
>>>>> blocked in read()/write()(like null_blk, tmpfs, ...), but it is
>>>>> impossible
>>>>> with
>>>>> workqueue any more.  Also block plug&unplug should have been used
>>>>> with kthread to optimize the case, especially when kernel AIO is
>>>>> applied,
>>>>> still impossible with work queue too.
>>>>
>>>>
>>>>
>>>> OK, that one is actually a good point, since one need not do per-item
>>>> queueing. We could handle different units, though. And we should have
>>>> proper
>>>> marking of the last item in a chain of stuff, so we might even be able to
>>>> offload based on that instead of doing single items. It wont help the
>>>> sync
>>>> case, but for that, workqueue and kthread would be identical.
>>>
>>>
>>> We may do that by introducing callback of queue_rq_list in blk_mq_ops,
>>> and I will figure out one patch today to see if it can help the case.
>>
>>
>> I don't think we should add to the interface, I prefer keeping it clean like
>> it is right now. At least not if we can get around it. My point is that the
>> driver already knows when the chain is complete, when REQ_LAST is set. So
>> before that event triggers, it need not kick off IO, or at least i could do
>> it in batches before that. That may not be fully reliable in case of
>> queueing errors, but if REQ_LAST or 'error return' is used as the way to
>> kick off pending IO, then that should be good enough. Haven't audited this
>> in a while, but at least that is the intent of REQ_LAST.
>
> Yes, I thought of too, but driver need another context for handling that,
> either workqueue or kthread, which may cause the introduced per-device
> workqueue useless.

Hmmm, a list should be enough, will do that.

Thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-21  3:13                           ` Ming Lei
  2014-08-21  3:15                             ` Ming Lei
@ 2014-08-21  3:16                             ` Jens Axboe
  1 sibling, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2014-08-21  3:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open, list

On 2014-08-20 22:13, Ming Lei wrote:
> On Thu, Aug 21, 2014 at 10:58 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-08-20 21:54, Ming Lei wrote:
>>>>>
>>>>>    From my investigation, context switch increases almost 50% with
>>>>> workqueue compared with kthread in loop in a quad-core VM. With
>>>>> kthread, requests may be handled as batch in cases which won't be
>>>>> blocked in read()/write()(like null_blk, tmpfs, ...), but it is
>>>>> impossible
>>>>> with
>>>>> workqueue any more.  Also block plug&unplug should have been used
>>>>> with kthread to optimize the case, especially when kernel AIO is
>>>>> applied,
>>>>> still impossible with work queue too.
>>>>
>>>>
>>>>
>>>> OK, that one is actually a good point, since one need not do per-item
>>>> queueing. We could handle different units, though. And we should have
>>>> proper
>>>> marking of the last item in a chain of stuff, so we might even be able to
>>>> offload based on that instead of doing single items. It wont help the
>>>> sync
>>>> case, but for that, workqueue and kthread would be identical.
>>>
>>>
>>> We may do that by introducing callback of queue_rq_list in blk_mq_ops,
>>> and I will figure out one patch today to see if it can help the case.
>>
>>
>> I don't think we should add to the interface, I prefer keeping it clean like
>> it is right now. At least not if we can get around it. My point is that the
>> driver already knows when the chain is complete, when REQ_LAST is set. So
>> before that event triggers, it need not kick off IO, or at least i could do
>> it in batches before that. That may not be fully reliable in case of
>> queueing errors, but if REQ_LAST or 'error return' is used as the way to
>> kick off pending IO, then that should be good enough. Haven't audited this
>> in a while, but at least that is the intent of REQ_LAST.
>
> Yes, I thought of too, but driver need another context for handling that,
> either workqueue or kthread, which may cause the introduced per-device
> workqueue useless.

Yeah, if we are going to go this route instead, then it may make more 
sense to have a ->commit() hook to compliment ->queue_rq(). ->queue_rq() 
would then continue to run like it is now, and ->commit() would offload 
the pieces to a workqueue. Or we'd just do this on REQ_LAST. Same sort 
of thing, just handled a bit differently.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-21  2:58                         ` Jens Axboe
  2014-08-21  3:13                           ` Ming Lei
@ 2014-08-21  3:34                           ` Ming Lei
  1 sibling, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-21  3:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO <linux-aio@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Tejun Heo

On Thu, Aug 21, 2014 at 10:58 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-08-20 21:54, Ming Lei wrote:
>>>>
>>>>   From my investigation, context switch increases almost 50% with
>>>> workqueue compared with kthread in loop in a quad-core VM. With
>>>> kthread, requests may be handled as batch in cases which won't be
>>>> blocked in read()/write()(like null_blk, tmpfs, ...), but it is
>>>> impossible
>>>> with
>>>> workqueue any more.  Also block plug&unplug should have been used
>>>> with kthread to optimize the case, especially when kernel AIO is
>>>> applied,
>>>> still impossible with work queue too.
>>>
>>>
>>>
>>> OK, that one is actually a good point, since one need not do per-item
>>> queueing. We could handle different units, though. And we should have
>>> proper
>>> marking of the last item in a chain of stuff, so we might even be able to
>>> offload based on that instead of doing single items. It wont help the
>>> sync
>>> case, but for that, workqueue and kthread would be identical.
>>
>>
>> We may do that by introducing callback of queue_rq_list in blk_mq_ops,
>> and I will figure out one patch today to see if it can help the case.
>
>
> I don't think we should add to the interface, I prefer keeping it clean like
> it is right now. At least not if we can get around it. My point is that the
> driver already knows when the chain is complete, when REQ_LAST is set. So
> before that event triggers, it need not kick off IO, or at least i could do
> it in batches before that. That may not be fully reliable in case of
> queueing errors, but if REQ_LAST or 'error return' is used as the way to
> kick off pending IO, then that should be good enough. Haven't audited this
> in a while, but at least that is the intent of REQ_LAST.

Another point is that running N queue_work(rq) may cost more
than running one time queue_work(N rqs) since context still may
switch back and forth when executing queue_work().

Anyway I need to run test first to see if it can bring back throughout
on sequential read by handling them as batch.


Thanks,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-19 20:50                 ` Jens Axboe
  2014-08-20  1:23                   ` Ming Lei
@ 2014-08-21  5:44                   ` Ming Lei
  2014-08-27 16:08                       ` Maxim Patlasov
  1 sibling, 1 reply; 74+ messages in thread
From: Ming Lei @ 2014-08-21  5:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	open list:AIO <linux-aio@kvack.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner

On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:

>
>
> Reworked a bit more:
>
> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6

One big problem of the commit is that it is basically a serialized workqueue
because of single &hctx->run_work, and per-req work_struct has to be
used for concurrent implementation.  So looks the approach isn't flexible
enough compared with doing that in driver, or any idea about how to fix
that?

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-21  5:44                   ` Ming Lei
@ 2014-08-27 16:08                       ` Maxim Patlasov
  0 siblings, 0 replies; 74+ messages in thread
From: Maxim Patlasov @ 2014-08-27 16:08 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	AIO, Linux FS Devel, Dave Chinner,

On 08/21/2014 09:44 AM, Ming Lei wrote:
> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>
>>
>> Reworked a bit more:
>>
>> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
> One big problem of the commit is that it is basically a serialized workqueue
> because of single &hctx->run_work, and per-req work_struct has to be
> used for concurrent implementation.  So looks the approach isn't flexible
> enough compared with doing that in driver, or any idea about how to fix
> that?
>

I'm interested what's the price of handling requests in a separate 
thread at large. I used the following fio script:

     [global]
     direct=1
     bsrange=512-512
     timeout=10
     numjobs=1
     ioengine=sync

     filename=/dev/loop0 # or /dev/nullb0

     [f1]
     rw=randwrite

to compare the performance of:

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
2) the same as above, but call loop_queue_work() directly from 
loop_queue_rq() -- 270K iops
3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

Taking into account so big difference (11K vs. 270K), would it be worthy 
to implement pure non-blocking version of aio_kernel_submit() returning 
error if blocking needed? Then loop driver (or any other in-kernel user) 
might firstly try that non-blocking submit as fast-path, and, only if 
it's failed, fall back to queueing.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-27 16:08                       ` Maxim Patlasov
  0 siblings, 0 replies; 74+ messages in thread
From: Maxim Patlasov @ 2014-08-27 16:08 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Andrew Morton,
	Dave Kleikamp, Zach Brown, Benjamin LaHaise, Kent Overstreet,
	AIO, Linux FS Devel, Dave Chinner

On 08/21/2014 09:44 AM, Ming Lei wrote:
> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>
>>
>> Reworked a bit more:
>>
>> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
> One big problem of the commit is that it is basically a serialized workqueue
> because of single &hctx->run_work, and per-req work_struct has to be
> used for concurrent implementation.  So looks the approach isn't flexible
> enough compared with doing that in driver, or any idea about how to fix
> that?
>

I'm interested what's the price of handling requests in a separate 
thread at large. I used the following fio script:

     [global]
     direct=1
     bsrange=512-512
     timeout=10
     numjobs=1
     ioengine=sync

     filename=/dev/loop0 # or /dev/nullb0

     [f1]
     rw=randwrite

to compare the performance of:

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
2) the same as above, but call loop_queue_work() directly from 
loop_queue_rq() -- 270K iops
3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

Taking into account so big difference (11K vs. 270K), would it be worthy 
to implement pure non-blocking version of aio_kernel_submit() returning 
error if blocking needed? Then loop driver (or any other in-kernel user) 
might firstly try that non-blocking submit as fast-path, and, only if 
it's failed, fall back to queueing.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-27 16:08                       ` Maxim Patlasov
@ 2014-08-27 16:29                         ` Benjamin LaHaise
  -1 siblings, 0 replies; 74+ messages in thread
From: Benjamin LaHaise @ 2014-08-27 16:29 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Zach Brown, Kent Overstreet, AIO, Linux FS Devel, Dave Chinner,

On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
...
> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> 2) the same as above, but call loop_queue_work() directly from 
> loop_queue_rq() -- 270K iops
> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> 
> Taking into account so big difference (11K vs. 270K), would it be worthy 
> to implement pure non-blocking version of aio_kernel_submit() returning 
> error if blocking needed? Then loop driver (or any other in-kernel user) 
> might firstly try that non-blocking submit as fast-path, and, only if 
> it's failed, fall back to queueing.

What filesystem is the backing file for loop0 on?  O_DIRECT access as 
Ming's patches use should be non-blocking, and if not, that's something 
to fix.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-27 16:29                         ` Benjamin LaHaise
  0 siblings, 0 replies; 74+ messages in thread
From: Benjamin LaHaise @ 2014-08-27 16:29 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Zach Brown, Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
...
> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> 2) the same as above, but call loop_queue_work() directly from 
> loop_queue_rq() -- 270K iops
> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> 
> Taking into account so big difference (11K vs. 270K), would it be worthy 
> to implement pure non-blocking version of aio_kernel_submit() returning 
> error if blocking needed? Then loop driver (or any other in-kernel user) 
> might firstly try that non-blocking submit as fast-path, and, only if 
> it's failed, fall back to queueing.

What filesystem is the backing file for loop0 on?  O_DIRECT access as 
Ming's patches use should be non-blocking, and if not, that's something 
to fix.

		-ben
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-27 16:29                         ` Benjamin LaHaise
@ 2014-08-27 17:19                           ` Maxim Patlasov
  -1 siblings, 0 replies; 74+ messages in thread
From: Maxim Patlasov @ 2014-08-27 17:19 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Zach Brown, Kent Overstreet, AIO, Linux FS Devel, Dave Chinner,

On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:
> On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
> ...
>> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
>> 2) the same as above, but call loop_queue_work() directly from
>> loop_queue_rq() -- 270K iops
>> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
>>
>> Taking into account so big difference (11K vs. 270K), would it be worthy
>> to implement pure non-blocking version of aio_kernel_submit() returning
>> error if blocking needed? Then loop driver (or any other in-kernel user)
>> might firstly try that non-blocking submit as fast-path, and, only if
>> it's failed, fall back to queueing.
> What filesystem is the backing file for loop0 on?  O_DIRECT access as
> Ming's patches use should be non-blocking, and if not, that's something
> to fix.

I used loop0 directly on top of null_blk driver (because my goal was to 
measure the overhead of processing requests in a separate thread).

In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may 
easily block on something like bh_submit_read(), when fs reads file 
metadata to calculate the offset on block device by position in the file.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-27 17:19                           ` Maxim Patlasov
  0 siblings, 0 replies; 74+ messages in thread
From: Maxim Patlasov @ 2014-08-27 17:19 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Zach Brown, Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:
> On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
> ...
>> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
>> 2) the same as above, but call loop_queue_work() directly from
>> loop_queue_rq() -- 270K iops
>> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
>>
>> Taking into account so big difference (11K vs. 270K), would it be worthy
>> to implement pure non-blocking version of aio_kernel_submit() returning
>> error if blocking needed? Then loop driver (or any other in-kernel user)
>> might firstly try that non-blocking submit as fast-path, and, only if
>> it's failed, fall back to queueing.
> What filesystem is the backing file for loop0 on?  O_DIRECT access as
> Ming's patches use should be non-blocking, and if not, that's something
> to fix.

I used loop0 directly on top of null_blk driver (because my goal was to 
measure the overhead of processing requests in a separate thread).

In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may 
easily block on something like bh_submit_read(), when fs reads file 
metadata to calculate the offset on block device by position in the file.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-27 17:19                           ` Maxim Patlasov
@ 2014-08-27 17:56                             ` Zach Brown
  -1 siblings, 0 replies; 74+ messages in thread
From: Zach Brown @ 2014-08-27 17:56 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Benjamin LaHaise, Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner,

On Wed, Aug 27, 2014 at 09:19:36PM +0400, Maxim Patlasov wrote:
> On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:
> >On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
> >...
> >>1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> >>2) the same as above, but call loop_queue_work() directly from
> >>loop_queue_rq() -- 270K iops
> >>3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> >>
> >>Taking into account so big difference (11K vs. 270K), would it be worthy
> >>to implement pure non-blocking version of aio_kernel_submit() returning
> >>error if blocking needed? Then loop driver (or any other in-kernel user)
> >>might firstly try that non-blocking submit as fast-path, and, only if
> >>it's failed, fall back to queueing.
> >What filesystem is the backing file for loop0 on?  O_DIRECT access as
> >Ming's patches use should be non-blocking, and if not, that's something
> >to fix.
> 
> I used loop0 directly on top of null_blk driver (because my goal was to
> measure the overhead of processing requests in a separate thread).

The relative overhead while doing nothing else.  While zooming way down
in to micro benchmarks is fun and all, testing on an fs on brd might be
more representitive and so more compelling.

(And you might start to stumble into the terrifying territory of
stacking fs write paths under fs write paths.. turn on lockdep! :))

> In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may easily
> block on something like bh_submit_read(), when fs reads file metadata to
> calculate the offset on block device by position in the file.

Yeah, there are a lot of rare potential blocking points throughout the
fs aio submission paths.   In practice (aio+dio+block fs), I think
submission tends to block waiting for congested block queues most often.

- z

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-27 17:56                             ` Zach Brown
  0 siblings, 0 replies; 74+ messages in thread
From: Zach Brown @ 2014-08-27 17:56 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Benjamin LaHaise, Ming Lei, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On Wed, Aug 27, 2014 at 09:19:36PM +0400, Maxim Patlasov wrote:
> On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:
> >On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
> >...
> >>1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> >>2) the same as above, but call loop_queue_work() directly from
> >>loop_queue_rq() -- 270K iops
> >>3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> >>
> >>Taking into account so big difference (11K vs. 270K), would it be worthy
> >>to implement pure non-blocking version of aio_kernel_submit() returning
> >>error if blocking needed? Then loop driver (or any other in-kernel user)
> >>might firstly try that non-blocking submit as fast-path, and, only if
> >>it's failed, fall back to queueing.
> >What filesystem is the backing file for loop0 on?  O_DIRECT access as
> >Ming's patches use should be non-blocking, and if not, that's something
> >to fix.
> 
> I used loop0 directly on top of null_blk driver (because my goal was to
> measure the overhead of processing requests in a separate thread).

The relative overhead while doing nothing else.  While zooming way down
in to micro benchmarks is fun and all, testing on an fs on brd might be
more representitive and so more compelling.

(And you might start to stumble into the terrifying territory of
stacking fs write paths under fs write paths.. turn on lockdep! :))

> In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may easily
> block on something like bh_submit_read(), when fs reads file metadata to
> calculate the offset on block device by position in the file.

Yeah, there are a lot of rare potential blocking points throughout the
fs aio submission paths.   In practice (aio+dio+block fs), I think
submission tends to block waiting for congested block queues most often.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-27 16:08                       ` Maxim Patlasov
@ 2014-08-28  2:06                         ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-28  2:06 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Jens Axboe, Christoph Hellwig, Linux Kernel Mailing List,
	Andrew Morton, Dave Kleikamp, Zach Brown, Benjamin LaHaise,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 8/28/14, Maxim Patlasov <mpatlasov@parallels.com> wrote:
> On 08/21/2014 09:44 AM, Ming Lei wrote:
>> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>>>
>>> Reworked a bit more:
>>>
>>> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
>> One big problem of the commit is that it is basically a serialized
>> workqueue
>> because of single &hctx->run_work, and per-req work_struct has to be
>> used for concurrent implementation.  So looks the approach isn't flexible
>> enough compared with doing that in driver, or any idea about how to fix
>> that?
>>
>
> I'm interested what's the price of handling requests in a separate
> thread at large. I used the following fio script:
>
>      [global]
>      direct=1
>      bsrange=512-512
>      timeout=10
>      numjobs=1
>      ioengine=sync
>
>      filename=/dev/loop0 # or /dev/nullb0
>
>      [f1]
>      rw=randwrite
>
> to compare the performance of:
>
> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops

If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this
result since blk-mq implements a serialized workqueue.

> 2) the same as above, but call loop_queue_work() directly from
> loop_queue_rq() -- 270K iops
> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

In my recent investigation and discussion with Jens, using workqueue
may introduce some regression for cases like loop over null_blk, tmpfs.

And 270K vs. 380K is a bit similar with my result, and it was observed that
context switch is increased by more than 50% with introducing workqueue.

I will post V3 which will use previous kthread, with blk-mq & kernel aio, which
should make full use of blk-mq and kernel aio, and won't introduce regression
for cases like above.

> Taking into account so big difference (11K vs. 270K), would it be worthy
> to implement pure non-blocking version of aio_kernel_submit() returning
> error if blocking needed? Then loop driver (or any other in-kernel user)

The kernel aio submit is very similar with user space's implementation,
except for block plug&unplug usage in user space aio submit path.

If it is blocked in aio_kernel_submit(), you should observe similar thing
with io_submit() too.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-28  2:06                         ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-28  2:06 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Jens Axboe, Christoph Hellwig, Linux Kernel Mailing List,
	Andrew Morton, Dave Kleikamp, Zach Brown, Benjamin LaHaise,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 8/28/14, Maxim Patlasov <mpatlasov@parallels.com> wrote:
> On 08/21/2014 09:44 AM, Ming Lei wrote:
>> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>>>
>>> Reworked a bit more:
>>>
>>> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
>> One big problem of the commit is that it is basically a serialized
>> workqueue
>> because of single &hctx->run_work, and per-req work_struct has to be
>> used for concurrent implementation.  So looks the approach isn't flexible
>> enough compared with doing that in driver, or any idea about how to fix
>> that?
>>
>
> I'm interested what's the price of handling requests in a separate
> thread at large. I used the following fio script:
>
>      [global]
>      direct=1
>      bsrange=512-512
>      timeout=10
>      numjobs=1
>      ioengine=sync
>
>      filename=/dev/loop0 # or /dev/nullb0
>
>      [f1]
>      rw=randwrite
>
> to compare the performance of:
>
> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops

If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this
result since blk-mq implements a serialized workqueue.

> 2) the same as above, but call loop_queue_work() directly from
> loop_queue_rq() -- 270K iops
> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops

In my recent investigation and discussion with Jens, using workqueue
may introduce some regression for cases like loop over null_blk, tmpfs.

And 270K vs. 380K is a bit similar with my result, and it was observed that
context switch is increased by more than 50% with introducing workqueue.

I will post V3 which will use previous kthread, with blk-mq & kernel aio, which
should make full use of blk-mq and kernel aio, and won't introduce regression
for cases like above.

> Taking into account so big difference (11K vs. 270K), would it be worthy
> to implement pure non-blocking version of aio_kernel_submit() returning
> error if blocking needed? Then loop driver (or any other in-kernel user)

The kernel aio submit is very similar with user space's implementation,
except for block plug&unplug usage in user space aio submit path.

If it is blocked in aio_kernel_submit(), you should observe similar thing
with io_submit() too.

Thanks,
-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-27 17:56                             ` Zach Brown
@ 2014-08-28  2:10                               ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-28  2:10 UTC (permalink / raw)
  To: Zach Brown
  Cc: Maxim Patlasov, Benjamin LaHaise, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 8/28/14, Zach Brown <zab@zabbo.net> wrote:
> On Wed, Aug 27, 2014 at 09:19:36PM +0400, Maxim Patlasov wrote:
>> On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:
>> >On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
>> >...
>> >>1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
>> >>2) the same as above, but call loop_queue_work() directly from
>> >>loop_queue_rq() -- 270K iops
>> >>3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
>> >>
>> >>Taking into account so big difference (11K vs. 270K), would it be
>> >> worthy
>> >>to implement pure non-blocking version of aio_kernel_submit() returning
>> >>error if blocking needed? Then loop driver (or any other in-kernel
>> >> user)
>> >>might firstly try that non-blocking submit as fast-path, and, only if
>> >>it's failed, fall back to queueing.
>> >What filesystem is the backing file for loop0 on?  O_DIRECT access as
>> >Ming's patches use should be non-blocking, and if not, that's something
>> >to fix.
>>
>> I used loop0 directly on top of null_blk driver (because my goal was to
>> measure the overhead of processing requests in a separate thread).
>
> The relative overhead while doing nothing else.  While zooming way down
> in to micro benchmarks is fun and all, testing on an fs on brd might be
> more representitive and so more compelling.
>
> (And you might start to stumble into the terrifying territory of
> stacking fs write paths under fs write paths.. turn on lockdep! :))
>
>> In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may
>> easily
>> block on something like bh_submit_read(), when fs reads file metadata to
>> calculate the offset on block device by position in the file.
>
> Yeah, there are a lot of rare potential blocking points throughout the
> fs aio submission paths.   In practice (aio+dio+block fs), I think
> submission tends to block waiting for congested block queues most often.

In case of null_blk, it shouldn't have blocked here since the defaul tag size
is enough for the single job test if Maxim didn't change the default parameter
of null_blk.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-28  2:10                               ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2014-08-28  2:10 UTC (permalink / raw)
  To: Zach Brown
  Cc: Maxim Patlasov, Benjamin LaHaise, Jens Axboe, Christoph Hellwig,
	Linux Kernel Mailing List, Andrew Morton, Dave Kleikamp,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 8/28/14, Zach Brown <zab@zabbo.net> wrote:
> On Wed, Aug 27, 2014 at 09:19:36PM +0400, Maxim Patlasov wrote:
>> On 08/27/2014 08:29 PM, Benjamin LaHaise wrote:
>> >On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
>> >...
>> >>1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
>> >>2) the same as above, but call loop_queue_work() directly from
>> >>loop_queue_rq() -- 270K iops
>> >>3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
>> >>
>> >>Taking into account so big difference (11K vs. 270K), would it be
>> >> worthy
>> >>to implement pure non-blocking version of aio_kernel_submit() returning
>> >>error if blocking needed? Then loop driver (or any other in-kernel
>> >> user)
>> >>might firstly try that non-blocking submit as fast-path, and, only if
>> >>it's failed, fall back to queueing.
>> >What filesystem is the backing file for loop0 on?  O_DIRECT access as
>> >Ming's patches use should be non-blocking, and if not, that's something
>> >to fix.
>>
>> I used loop0 directly on top of null_blk driver (because my goal was to
>> measure the overhead of processing requests in a separate thread).
>
> The relative overhead while doing nothing else.  While zooming way down
> in to micro benchmarks is fun and all, testing on an fs on brd might be
> more representitive and so more compelling.
>
> (And you might start to stumble into the terrifying territory of
> stacking fs write paths under fs write paths.. turn on lockdep! :))
>
>> In case of real-life filesystems, e.g. ext4, aio_kernel_submit() may
>> easily
>> block on something like bh_submit_read(), when fs reads file metadata to
>> calculate the offset on block device by position in the file.
>
> Yeah, there are a lot of rare potential blocking points throughout the
> fs aio submission paths.   In practice (aio+dio+block fs), I think
> submission tends to block waiting for congested block queues most often.

In case of null_blk, it shouldn't have blocked here since the defaul tag size
is enough for the single job test if Maxim didn't change the default parameter
of null_blk.


Thanks,
-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
  2014-08-28  2:06                         ` Ming Lei
@ 2014-08-29 11:14                           ` Maxim Patlasov
  -1 siblings, 0 replies; 74+ messages in thread
From: Maxim Patlasov @ 2014-08-29 11:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, Linux Kernel Mailing List,
	Andrew Morton, Dave Kleikamp, Zach Brown, Benjamin LaHaise,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 08/28/2014 06:06 AM, Ming Lei wrote:
> On 8/28/14, Maxim Patlasov <mpatlasov@parallels.com> wrote:
>> On 08/21/2014 09:44 AM, Ming Lei wrote:
>>> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> Reworked a bit more:
>>>>
>>>> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
>>> One big problem of the commit is that it is basically a serialized
>>> workqueue
>>> because of single &hctx->run_work, and per-req work_struct has to be
>>> used for concurrent implementation.  So looks the approach isn't flexible
>>> enough compared with doing that in driver, or any idea about how to fix
>>> that?
>>>
>> I'm interested what's the price of handling requests in a separate
>> thread at large. I used the following fio script:
>>
>>       [global]
>>       direct=1
>>       bsrange=512-512
>>       timeout=10
>>       numjobs=1
>>       ioengine=sync
>>
>>       filename=/dev/loop0 # or /dev/nullb0
>>
>>       [f1]
>>       rw=randwrite
>>
>> to compare the performance of:
>>
>> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this
> result since blk-mq implements a serialized workqueue.

BLK_MQ_F_WQ_CONTEXT is not in 3.17.0-rc1, so I couldn't enable it.

>
>> 2) the same as above, but call loop_queue_work() directly from
>> loop_queue_rq() -- 270K iops
>> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> In my recent investigation and discussion with Jens, using workqueue
> may introduce some regression for cases like loop over null_blk, tmpfs.
>
> And 270K vs. 380K is a bit similar with my result, and it was observed that
> context switch is increased by more than 50% with introducing workqueue.

The figures are similar, but the comparison is not. Both 270K and 380K 
refer to configurations where no extra context switch involved.

>
> I will post V3 which will use previous kthread, with blk-mq & kernel aio, which
> should make full use of blk-mq and kernel aio, and won't introduce regression
> for cases like above.

That would be great!

>
>> Taking into account so big difference (11K vs. 270K), would it be worthy
>> to implement pure non-blocking version of aio_kernel_submit() returning
>> error if blocking needed? Then loop driver (or any other in-kernel user)
> The kernel aio submit is very similar with user space's implementation,
> except for block plug&unplug usage in user space aio submit path.
>
> If it is blocked in aio_kernel_submit(), you should observe similar thing
> with io_submit() too.

Yes, I agree. My point was that there is a room for optimization as my 
experiments demonstrate. The question is whether it's worthy to 
sophisticate kernel aio (and fs-specific code too) for the sake of that 
optimization.

In fact, in a simple case like block fs on top of loopback device on top 
of a file on another block fs, what kernel aio does for loopback driver 
is a subtle way of converting incoming bio-s to outgoing bio-s. In case 
you know where the image file is placed (e.g. by fiemap), such a 
conversion may be done with zero overhead and anything that makes the 
overhead noticeable is suspicious. And it is easy to imagine other 
use-cases when that extra context switch is avoidable.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v1 5/9] block: loop: convert to blk-mq
@ 2014-08-29 11:14                           ` Maxim Patlasov
  0 siblings, 0 replies; 74+ messages in thread
From: Maxim Patlasov @ 2014-08-29 11:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, Linux Kernel Mailing List,
	Andrew Morton, Dave Kleikamp, Zach Brown, Benjamin LaHaise,
	Kent Overstreet, AIO, Linux FS Devel, Dave Chinner

On 08/28/2014 06:06 AM, Ming Lei wrote:
> On 8/28/14, Maxim Patlasov <mpatlasov@parallels.com> wrote:
>> On 08/21/2014 09:44 AM, Ming Lei wrote:
>>> On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> Reworked a bit more:
>>>>
>>>> http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
>>> One big problem of the commit is that it is basically a serialized
>>> workqueue
>>> because of single &hctx->run_work, and per-req work_struct has to be
>>> used for concurrent implementation.  So looks the approach isn't flexible
>>> enough compared with doing that in driver, or any idea about how to fix
>>> that?
>>>
>> I'm interested what's the price of handling requests in a separate
>> thread at large. I used the following fio script:
>>
>>       [global]
>>       direct=1
>>       bsrange=512-512
>>       timeout=10
>>       numjobs=1
>>       ioengine=sync
>>
>>       filename=/dev/loop0 # or /dev/nullb0
>>
>>       [f1]
>>       rw=randwrite
>>
>> to compare the performance of:
>>
>> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this
> result since blk-mq implements a serialized workqueue.

BLK_MQ_F_WQ_CONTEXT is not in 3.17.0-rc1, so I couldn't enable it.

>
>> 2) the same as above, but call loop_queue_work() directly from
>> loop_queue_rq() -- 270K iops
>> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> In my recent investigation and discussion with Jens, using workqueue
> may introduce some regression for cases like loop over null_blk, tmpfs.
>
> And 270K vs. 380K is a bit similar with my result, and it was observed that
> context switch is increased by more than 50% with introducing workqueue.

The figures are similar, but the comparison is not. Both 270K and 380K 
refer to configurations where no extra context switch involved.

>
> I will post V3 which will use previous kthread, with blk-mq & kernel aio, which
> should make full use of blk-mq and kernel aio, and won't introduce regression
> for cases like above.

That would be great!

>
>> Taking into account so big difference (11K vs. 270K), would it be worthy
>> to implement pure non-blocking version of aio_kernel_submit() returning
>> error if blocking needed? Then loop driver (or any other in-kernel user)
> The kernel aio submit is very similar with user space's implementation,
> except for block plug&unplug usage in user space aio submit path.
>
> If it is blocked in aio_kernel_submit(), you should observe similar thing
> with io_submit() too.

Yes, I agree. My point was that there is a room for optimization as my 
experiments demonstrate. The question is whether it's worthy to 
sophisticate kernel aio (and fs-specific code too) for the sake of that 
optimization.

In fact, in a simple case like block fs on top of loopback device on top 
of a file on another block fs, what kernel aio does for loopback driver 
is a subtle way of converting incoming bio-s to outgoing bio-s. In case 
you know where the image file is placed (e.g. by fiemap), such a 
conversion may be done with zero overhead and anything that makes the 
overhead noticeable is suspicious. And it is easy to imagine other 
use-cases when that extra context switch is avoidable.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2014-08-29 11:14 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-14 15:50 [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion Ming Lei
2014-08-14 15:50 ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 1/9] aio: add aio_kernel_() interface Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 18:07   ` Zach Brown
2014-08-14 18:07     ` Zach Brown
2014-08-15 13:20     ` Ming Lei
2014-08-15 13:20       ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 2/9] fd/direct-io: introduce should_dirty for kernel aio Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 3/9] blk-mq: export blk_mq_freeze_queue and blk_mq_unfreeze_queue Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 4/9] blk-mq: introduce init_flush_rq_fn callback in 'blk_mq_ops' Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-15 16:19   ` Jens Axboe
2014-08-15 16:19     ` Jens Axboe
2014-08-16  7:49     ` Ming Lei
2014-08-16  7:49       ` Ming Lei
2014-08-17 18:39       ` Jens Axboe
2014-08-17 18:39         ` Jens Axboe
2014-08-14 15:50 ` [PATCH v1 5/9] block: loop: convert to blk-mq Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-15 16:31   ` Christoph Hellwig
2014-08-15 16:31     ` Christoph Hellwig
2014-08-15 16:36     ` Jens Axboe
2014-08-15 16:46       ` Jens Axboe
2014-08-16  8:06         ` Ming Lei
2014-08-16  8:06           ` Ming Lei
2014-08-17 17:48           ` Jens Axboe
2014-08-17 17:48             ` Jens Axboe
2014-08-18  1:22             ` Ming Lei
2014-08-18  1:22               ` Ming Lei
2014-08-18 11:53               ` Ming Lei
2014-08-18 11:53                 ` Ming Lei
2014-08-19 20:50                 ` Jens Axboe
2014-08-20  1:23                   ` Ming Lei
2014-08-20 16:09                     ` Jens Axboe
2014-08-21  2:54                       ` Ming Lei
2014-08-21  2:58                         ` Jens Axboe
2014-08-21  3:13                           ` Ming Lei
2014-08-21  3:15                             ` Ming Lei
2014-08-21  3:16                             ` Jens Axboe
2014-08-21  3:34                           ` Ming Lei
2014-08-21  5:44                   ` Ming Lei
2014-08-27 16:08                     ` Maxim Patlasov
2014-08-27 16:08                       ` Maxim Patlasov
2014-08-27 16:29                       ` Benjamin LaHaise
2014-08-27 16:29                         ` Benjamin LaHaise
2014-08-27 17:19                         ` Maxim Patlasov
2014-08-27 17:19                           ` Maxim Patlasov
2014-08-27 17:56                           ` Zach Brown
2014-08-27 17:56                             ` Zach Brown
2014-08-28  2:10                             ` Ming Lei
2014-08-28  2:10                               ` Ming Lei
2014-08-28  2:06                       ` Ming Lei
2014-08-28  2:06                         ` Ming Lei
2014-08-29 11:14                         ` Maxim Patlasov
2014-08-29 11:14                           ` Maxim Patlasov
2014-08-14 15:50 ` [PATCH v1 6/9] block: loop: say goodby to bio Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 7/9] block: loop: introduce lo_discard() and lo_req_flush() Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 8/9] block: loop: don't handle REQ_FUA explicitly Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 15:50 ` [PATCH v1 9/9] block: loop: support to submit I/O via kernel aio based Ming Lei
2014-08-14 15:50   ` Ming Lei
2014-08-14 16:53 ` [PATCH v1 0/9] block & aio: kernel aio and loop mq conversion Jens Axboe
2014-08-14 16:53   ` Jens Axboe
2014-08-15 12:59   ` Ming Lei
2014-08-15 12:59     ` Ming Lei
2014-08-15 13:11     ` Christoph Hellwig
2014-08-15 13:11       ` Christoph Hellwig
2014-08-15 14:32       ` Ming Lei
2014-08-15 14:32         ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.