All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] block & aio: improve loop with kernel aio
@ 2015-01-13 15:44 Ming Lei
  2015-01-13 15:44   ` Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-13 15:44 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Benjamin LaHaise

Hi Guys,

The 1st two patches introduce kernel AIO support, most of
is borrowed from previous Dave's work, and thanks to ITER_BVEC,
it is much simper to implement kernel AIO now.

The last two patches applies kernel aio to loop-mq.

Follows benefits from using kernel aio in loop:
	- avoid double cache, and memory usage decreased a lot
	- system load gets much decreased

In the commit log of patch 4, detailed performance data
and system resource monitor information is provided about
using kernel aio for loop block.

V2:
	- remove 'extra' parameter to aio_kernel_alloc()
	- try to avoid memory allcation inside queue req callback
	- introduce 'use_mq' sysfs file for enabling kernel aio or disabling it
V1:
	- link:
		http://marc.info/?t=140803157700004&r=1&w=2
	- improve failure path in aio_kernel_submit()

 drivers/block/loop.c |  173 ++++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/block/loop.h |   11 ++++
 fs/aio.c             |  121 +++++++++++++++++++++++++++++++++++
 fs/direct-io.c       |    9 ++-
 include/linux/aio.h  |   17 ++++-
 5 files changed, 323 insertions(+), 8 deletions(-)


Thanks,
Ming Lei


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/4] aio: add aio_kernel_() interface
  2015-01-13 15:44 [PATCH v2 0/4] block & aio: improve loop with kernel aio Ming Lei
@ 2015-01-13 15:44   ` Ming Lei
  2015-01-13 15:44 ` [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio Ming Lei
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-13 15:44 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Benjamin LaHaise, linux-fsdevel,
	open list:AIO, Ming Lei

From: Dave Kleikamp <dave.kleikamp@oracle.com>

This adds an interface that lets kernel callers submit aio iocbs without
going through the user space syscalls.  This lets kernel callers avoid
the management limits and overhead of the context.  It will also let us
integrate aio operations with other kernel apis that the user space
interface doesn't have access to.

This patch is based on Dave's posts in below links:

	https://lkml.org/lkml/2013/10/16/365
	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ

Most of the patch is from Dave's directly, this version tries to
not couple kernel aio with linux-aio implementation.

Follows potential users of these APIs:

	- Loop block driver for avoiding double cache, and improving throughput
	- All kinds of kernel target(SCSI, USB, ...) which need to access
	file efficiently, and has the double cache problem too
	- socket users may benifit from the APIs too

Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-aio@kvack.org (open list:AIO)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/aio.c            |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   17 +++++++-
 2 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1b7893e..d044387 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1044,6 +1044,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 		iocb->ki_ctx = ERR_PTR(-EXDEV);
 		wake_up_process(iocb->ki_obj.tsk);
 		return;
+	} else if (is_kernel_kiocb(iocb)) {
+		iocb->ki_obj.complete(iocb->ki_user_data, res);
+		return;
 	}
 
 	if (iocb->ki_list.next) {
@@ -1503,6 +1506,124 @@ rw_common:
 	return 0;
 }
 
+/*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.
+ *
+ * The caller is expected to call the appropriate aio_kernel_init_() functions
+ * and then call aio_kernel_submit().  From that point forward progress is
+ * guaranteed by the file system aio method.  Eventually the caller's
+ * completion callback will be called.
+ *
+ * These iocbs are special.  They don't have a context, we don't limit the
+ * number pending, and they can't be canceled.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp)
+{
+	return kzalloc(sizeof(struct kiocb), gfp);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+	kfree(iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+			size_t nr, loff_t off,
+			void (*complete)(u64 user_data, long res),
+			u64 user_data)
+{
+	iocb->ki_filp = filp;
+	iocb->ki_nbytes = nr;
+	iocb->ki_pos = off;
+	iocb->ki_ctx = KERNEL_AIO_CTX;
+
+	iocb->ki_obj.complete = complete;
+	iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+static ssize_t aio_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_READ);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->read_iter)
+		return -EINVAL;
+
+	return file->f_op->read_iter(iocb, iter);
+}
+
+static ssize_t aio_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_WRITE);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->write_iter)
+		return -EINVAL;
+
+	file_start_write(file);
+	ret = file->f_op->write_iter(iocb, iter);
+	file_end_write(file);
+
+	return ret;
+}
+
+/*
+ * The iocb is our responsibility once this is called.  The caller must not
+ * reference it.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function.  The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb, bool is_write,
+		      struct iov_iter *iter)
+{
+	int ret = -EINVAL;
+
+	if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
+			|| !iocb->ki_filp || !(iter->type & ITER_BVEC)))
+		return ret;
+
+	if (is_write)
+		ret = aio_write_iter(iocb, iter);
+	else
+		ret = aio_read_iter(iocb, iter);
+
+	/*
+	 * use same policy with userspace aio, req may have been
+	 * completed already, so release it by aio completion.
+	 */
+	if (ret != -EIOCBQUEUED)
+		iocb->ki_obj.complete(iocb->ki_user_data, ret);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..7b4764a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -27,17 +27,21 @@ struct kiocb;
  */
 #define KIOCB_CANCELLED		((void *) (~0ULL))
 
+#define KERNEL_AIO_CTX		((void *) -1)
+
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
 struct kiocb {
 	struct file		*ki_filp;
-	struct kioctx		*ki_ctx;	/* NULL for sync ops */
+	struct kioctx		*ki_ctx;	/* NULL for sync ops,
+						 * -1 for kernel caller */
 	kiocb_cancel_fn		*ki_cancel;
 	void			*private;
 
 	union {
 		void __user		*user;
 		struct task_struct	*tsk;
+		void			(*complete)(u64 user_data, long res);
 	} ki_obj;
 
 	__u64			ki_user_data;	/* user's data for completion */
@@ -59,6 +63,11 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
 	return kiocb->ki_ctx == NULL;
 }
 
+static inline bool is_kernel_kiocb(struct kiocb *kiocb)
+{
+	return kiocb->ki_ctx == KERNEL_AIO_CTX;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
@@ -77,6 +86,12 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct kiocb *aio_kernel_alloc(gfp_t gfp);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp, size_t nr,
+			loff_t off, void (*complete)(u64 user_data, long res),
+			u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb, bool is_write, struct iov_iter *iter);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 1/4] aio: add aio_kernel_() interface
@ 2015-01-13 15:44   ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-13 15:44 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Benjamin LaHaise, linux-fsdevel,
	open list:AIO, Ming Lei

From: Dave Kleikamp <dave.kleikamp@oracle.com>

This adds an interface that lets kernel callers submit aio iocbs without
going through the user space syscalls.  This lets kernel callers avoid
the management limits and overhead of the context.  It will also let us
integrate aio operations with other kernel apis that the user space
interface doesn't have access to.

This patch is based on Dave's posts in below links:

	https://lkml.org/lkml/2013/10/16/365
	https://groups.google.com/forum/#!topic/linux.kernel/l7mogGJZoKQ

Most of the patch is from Dave's directly, this version tries to
not couple kernel aio with linux-aio implementation.

Follows potential users of these APIs:

	- Loop block driver for avoiding double cache, and improving throughput
	- All kinds of kernel target(SCSI, USB, ...) which need to access
	file efficiently, and has the double cache problem too
	- socket users may benifit from the APIs too

Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-aio@kvack.org (open list:AIO)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/aio.c            |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/aio.h |   17 +++++++-
 2 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1b7893e..d044387 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1044,6 +1044,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 		iocb->ki_ctx = ERR_PTR(-EXDEV);
 		wake_up_process(iocb->ki_obj.tsk);
 		return;
+	} else if (is_kernel_kiocb(iocb)) {
+		iocb->ki_obj.complete(iocb->ki_user_data, res);
+		return;
 	}
 
 	if (iocb->ki_list.next) {
@@ -1503,6 +1506,124 @@ rw_common:
 	return 0;
 }
 
+/*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.
+ *
+ * The caller is expected to call the appropriate aio_kernel_init_() functions
+ * and then call aio_kernel_submit().  From that point forward progress is
+ * guaranteed by the file system aio method.  Eventually the caller's
+ * completion callback will be called.
+ *
+ * These iocbs are special.  They don't have a context, we don't limit the
+ * number pending, and they can't be canceled.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp)
+{
+	return kzalloc(sizeof(struct kiocb), gfp);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+	kfree(iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+			size_t nr, loff_t off,
+			void (*complete)(u64 user_data, long res),
+			u64 user_data)
+{
+	iocb->ki_filp = filp;
+	iocb->ki_nbytes = nr;
+	iocb->ki_pos = off;
+	iocb->ki_ctx = KERNEL_AIO_CTX;
+
+	iocb->ki_obj.complete = complete;
+	iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+static ssize_t aio_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_READ);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->read_iter)
+		return -EINVAL;
+
+	return file->f_op->read_iter(iocb, iter);
+}
+
+static ssize_t aio_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t ret;
+
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		return -EBADF;
+
+	ret = security_file_permission(file, MAY_WRITE);
+	if (unlikely(ret))
+		return ret;
+
+	if (!file->f_op->write_iter)
+		return -EINVAL;
+
+	file_start_write(file);
+	ret = file->f_op->write_iter(iocb, iter);
+	file_end_write(file);
+
+	return ret;
+}
+
+/*
+ * The iocb is our responsibility once this is called.  The caller must not
+ * reference it.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function.  The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb, bool is_write,
+		      struct iov_iter *iter)
+{
+	int ret = -EINVAL;
+
+	if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
+			|| !iocb->ki_filp || !(iter->type & ITER_BVEC)))
+		return ret;
+
+	if (is_write)
+		ret = aio_write_iter(iocb, iter);
+	else
+		ret = aio_read_iter(iocb, iter);
+
+	/*
+	 * use same policy with userspace aio, req may have been
+	 * completed already, so release it by aio completion.
+	 */
+	if (ret != -EIOCBQUEUED)
+		iocb->ki_obj.complete(iocb->ki_user_data, ret);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..7b4764a 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -27,17 +27,21 @@ struct kiocb;
  */
 #define KIOCB_CANCELLED		((void *) (~0ULL))
 
+#define KERNEL_AIO_CTX		((void *) -1)
+
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
 struct kiocb {
 	struct file		*ki_filp;
-	struct kioctx		*ki_ctx;	/* NULL for sync ops */
+	struct kioctx		*ki_ctx;	/* NULL for sync ops,
+						 * -1 for kernel caller */
 	kiocb_cancel_fn		*ki_cancel;
 	void			*private;
 
 	union {
 		void __user		*user;
 		struct task_struct	*tsk;
+		void			(*complete)(u64 user_data, long res);
 	} ki_obj;
 
 	__u64			ki_user_data;	/* user's data for completion */
@@ -59,6 +63,11 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
 	return kiocb->ki_ctx == NULL;
 }
 
+static inline bool is_kernel_kiocb(struct kiocb *kiocb)
+{
+	return kiocb->ki_ctx == KERNEL_AIO_CTX;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
@@ -77,6 +86,12 @@ extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct kiocb *aio_kernel_alloc(gfp_t gfp);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp, size_t nr,
+			loff_t off, void (*complete)(u64 user_data, long res),
+			u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb, bool is_write, struct iov_iter *iter);
 #else
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio
  2015-01-13 15:44 [PATCH v2 0/4] block & aio: improve loop with kernel aio Ming Lei
  2015-01-13 15:44   ` Ming Lei
@ 2015-01-13 15:44 ` Ming Lei
  2015-01-25 13:34   ` Christoph Hellwig
  2015-01-13 15:44 ` [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2015-01-13 15:44 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Benjamin LaHaise, Ming Lei,
	Omar Sandoval, linux-fsdevel

For pages from kernel AIO, it is required to dirty
them before direct IO.

The idea is borrowd from Dave previous post.

Also set ->should_dirty only for ITER_IOVEC pages
based on recent discussion in following link:

	http://marc.info/?t=141904556800003&r=1&w=2

Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/direct-io.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..4dd6d14 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -121,6 +121,7 @@ struct dio {
 	int page_errors;		/* errno from get_user_pages() */
 	int is_async;			/* is IO async ? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
+	bool should_dirty;		/* if page should be dirty */
 	int io_error;			/* IO error in completion path */
 	unsigned long refcount;		/* direct_io_worker() and bios */
 	struct bio *bio_list;		/* singly linked via bi_private */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-	if (dio->is_async && dio->rw == READ)
+	if (dio->is_async && dio->rw == READ && dio->should_dirty)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
@@ -463,13 +464,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (!uptodate)
 		dio->io_error = -EIO;
 
-	if (dio->is_async && dio->rw == READ) {
+	if (dio->is_async && dio->rw == READ && dio->should_dirty) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		bio_for_each_segment_all(bvec, bio, i) {
 			struct page *page = bvec->bv_page;
 
-			if (dio->rw == READ && !PageCompound(page))
+			if (dio->rw == READ && !PageCompound(page) &&
+					dio->should_dirty)
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
@@ -1217,6 +1219,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
+	dio->should_dirty = iter_is_iovec(iter);
 
 	sdio.iter = iter;
 	sdio.final_block_in_request =
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file
  2015-01-13 15:44 [PATCH v2 0/4] block & aio: improve loop with kernel aio Ming Lei
  2015-01-13 15:44   ` Ming Lei
  2015-01-13 15:44 ` [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio Ming Lei
@ 2015-01-13 15:44 ` Ming Lei
  2015-01-25 13:35   ` Christoph Hellwig
  2015-01-26 17:57   ` Jeff Moyer
  2015-01-13 15:44 ` [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based Ming Lei
  2015-01-13 16:23 ` [PATCH v2 0/4] block & aio: improve loop with kernel aio Christoph Hellwig
  4 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-13 15:44 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Benjamin LaHaise, Ming Lei

So that users can control if kernel aio is used to submit I/O.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |   33 +++++++++++++++++++++++++++++++++
 drivers/block/loop.h |    1 +
 2 files changed, 34 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d1f168b..47af456 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -659,6 +659,38 @@ static ssize_t loop_attr_partscan_show(struct loop_device *lo, char *buf)
 	return sprintf(buf, "%s\n", partscan ? "1" : "0");
 }
 
+static ssize_t loop_attr_do_show_use_aio(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct loop_device *lo = disk->private_data;
+
+	return sprintf(buf, "%s\n", lo->use_aio ? "1" : "0");
+}
+
+ssize_t loop_attr_do_store_use_aio(struct device *dev,
+		struct device_attribute *attr, const char *buf,
+		size_t count)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct loop_device *lo = disk->private_data;
+	int err;
+	unsigned long v;
+
+	err = kstrtoul(buf, 10, &v);
+	if (err < 0)
+		return err;
+	if (v)
+		lo->use_aio = true;
+	else
+		lo->use_aio = false;
+	return count;
+}
+
+static struct device_attribute loop_attr_use_aio =
+	__ATTR(use_aio, S_IRUGO | S_IWUSR, loop_attr_do_show_use_aio,
+			loop_attr_do_store_use_aio);
+
 LOOP_ATTR_RO(backing_file);
 LOOP_ATTR_RO(offset);
 LOOP_ATTR_RO(sizelimit);
@@ -671,6 +703,7 @@ static struct attribute *loop_attrs[] = {
 	&loop_attr_sizelimit.attr,
 	&loop_attr_autoclear.attr,
 	&loop_attr_partscan.attr,
+	&loop_attr_use_aio.attr,
 	NULL,
 };
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 301c27f..15049e9 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -57,6 +57,7 @@ struct loop_device {
 	struct list_head	write_cmd_head;
 	struct work_struct	write_work;
 	bool			write_started;
+	bool			use_aio;
 	int			lo_state;
 	struct mutex		lo_ctl_mutex;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
  2015-01-13 15:44 [PATCH v2 0/4] block & aio: improve loop with kernel aio Ming Lei
                   ` (2 preceding siblings ...)
  2015-01-13 15:44 ` [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file Ming Lei
@ 2015-01-13 15:44 ` Ming Lei
  2015-01-25 13:40   ` Christoph Hellwig
  2015-03-18 18:28   ` Maxim Patlasov
  2015-01-13 16:23 ` [PATCH v2 0/4] block & aio: improve loop with kernel aio Christoph Hellwig
  4 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-13 15:44 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Benjamin LaHaise, Ming Lei

Part of the patch is based on Dave's previous post.

This patch submits I/O to fs via kernel aio, and we
can obtain following benefits:

	- double cache in both loop file system and backend file
	gets avoided
	- context switch decreased a lot, and finally CPU utilization
	is decreased
	- cached memory got decreased a lot

One main side effect is that throughput is decreased when
accessing raw loop block(not by filesystem) with kernel aio.

This patch has passed xfstests test(./check -g auto), and
both test and scratch devices are loop block, file system is ext4.

Follows two fio tests' result:

1. fio test inside ext4 file system over loop block
1) How to run
	- linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
	- loop over SSD image 1 in ext4
	- linux psync, 16 jobs, size 200M, ext4 over loop block
	- test result: IOPS from fio output

2) Throughput result:
	-------------------------------------------------------------
	test cases          |randread   |read   |randwrite  |write  |
	-------------------------------------------------------------
	base                |16799      |59508  |31059      |58829
	-------------------------------------------------------------
	base+kernel aio     |15480      |64453  |30187      |57222
	-------------------------------------------------------------

3) CPU
	- context switch decreased to 1/3 ~ 1/2 with kernel aio,
	depends on load, see 'Contexts' of [1] and [2]
	- CPU utilization decreased to 1/2 ~ 2/3 with kernel aio,
	depends on load, see 'CPUs' of [1] and [2]
	- less processes created with kernel aio, see 'Processes' of
	[1] and [2]

4) memory(free, cached)
	- After these four tests with kernel aio: ~10% memory becomes used
	- After these four tests without kernel aio: ~60% memory becomes used
	- see 'Memory Usage' of [1] and [2]

2. fio test over loop block directly
1) How to run
	- linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
	- loop over SSD image 2 in ext4
	- linux aio/O_DIRECT/bs: 4K/64 io depth/one job over loop block
	- test result: IOPS from fio output

2) Throughput result:
	-------------------------------------------------------------
	test cases          |randread   |read   |randwrite  |write  |
	-------------------------------------------------------------
	base                |24568      |55141  |34231      |43694
	-------------------------------------------------------------
	base+kernel aio     |25130      |22813  |24441      |40880
	-------------------------------------------------------------

3) CPU:
	- CPU utilization decreased to 1/2 ~ 2/3 with kernel aio during
	randread, read test and randwrite test, but a bit increased
	in write tests, See 'Cpus' of [3] and [4]
	- Context switch has similar result with above too, see 'Contexts'
	of [3] and [4]
	- Less processes created in randread test, a bit more processes
	in write test with kernel aio

4) Memory:
	- After these four tests with kernel aio: ~15% memory becomes used
	- After these four tests without kernel aio: ~90% memory becomes used
	- see 'Memory Usage' of [3] and [4]

3. sar monitor result in graphical style
[1], linux kernel base: sar monitor result in case of fio test 1
	http://kernel.ubuntu.com/~ming/block/loop-mq-aio/v2/vm-loop-mq-fio-ext4.pdf

[2], linux kernel base plus kernel aio patch: sar monitor result in case of fio test 1
	http://kernel.ubuntu.com/~ming/block/loop-mq-aio/v2/vm-loop-mq-aio-fio-ext4.pdf

[3], linux kernel base: sar monitor result in case of fio test 2
	http://kernel.ubuntu.com/~ming/block/loop-mq-aio/v2/vm-loop-mq-fio-disk.pdf

[4], linux kernel base plus kernel aio patch: sar monitor result in case of fio test 2
	http://kernel.ubuntu.com/~ming/block/loop-mq-aio/v2/vm-loop-mq-aio-fio-disk.pdf

Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Zach Brown <zab@zabbo.net>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/block/loop.h |   10 ++++
 2 files changed, 146 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 47af456..bce06e7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -450,10 +450,84 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq)
 	return ret;
 }
 
+#ifdef CONFIG_AIO
+static void lo_rw_aio_complete(u64 data, long res)
+{
+	struct loop_cmd *cmd = (struct loop_cmd *)(uintptr_t)data;
+	struct request *rq = cmd->rq;
+
+	if (res > 0)
+		res = 0;
+	else if (res < 0)
+		res = -EIO;
+
+	kfree(cmd->alloc_bv);
+	rq->errors = res;
+	blk_mq_complete_request(rq);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     bool write, loff_t pos)
+{
+	unsigned int i = 0;
+	struct iov_iter iter;
+	struct bio_vec *bvec, bv;
+	size_t nr_segs = 0;
+	struct req_iterator r_iter;
+
+	rq_for_each_segment(bv, cmd->rq, r_iter)
+		nr_segs++;
+
+	if (nr_segs > LOOP_CMD_BVEC_CNT) {
+		cmd->alloc_bv = kmalloc(nr_segs * sizeof(*cmd->alloc_bv),
+					GFP_NOIO);
+		if (!cmd->alloc_bv)
+			return -ENOMEM;
+		bvec = cmd->alloc_bv;
+	} else {
+		bvec = cmd->bv;
+		cmd->alloc_bv = NULL;
+	}
+
+	rq_for_each_segment(bv, cmd->rq, r_iter)
+		bvec[i++] = bv;
+
+	iter.type = ITER_BVEC | (write ? WRITE : 0);
+	iter.bvec = bvec;
+	iter.nr_segs = nr_segs;
+	iter.count = blk_rq_bytes(cmd->rq);
+	iter.iov_offset = 0;
+
+	aio_kernel_init_rw(&cmd->iocb, lo->lo_backing_file,
+			   iov_iter_count(&iter), pos,
+			   lo_rw_aio_complete, (u64)(uintptr_t)cmd);
+
+	return aio_kernel_submit(&cmd->iocb, write, &iter);
+}
+#else
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     bool write, loff_t pos)
+{
+	return -EIO;
+}
+#endif /* CONFIG_AIO */
+
+static int lo_io_rw(struct loop_device *lo, struct loop_cmd *cmd,
+		    bool write, loff_t pos)
+{
+	if (cmd->use_aio)
+		return lo_rw_aio(lo, cmd, write, pos);
+	if (write)
+		return lo_send(lo, cmd->rq, pos);
+	else
+		return lo_receive(lo, cmd->rq, lo->lo_blocksize, pos);
+}
+
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	loff_t pos;
 	int ret;
+	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
 	pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
@@ -463,9 +537,9 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 		else if (rq->cmd_flags & REQ_DISCARD)
 			ret = lo_discard(lo, rq, pos);
 		else
-			ret = lo_send(lo, rq, pos);
+			ret = lo_io_rw(lo, cmd, true, pos);
 	} else
-		ret = lo_receive(lo, rq, lo->lo_blocksize, pos);
+		ret = lo_io_rw(lo, cmd, false, pos);
 
 	return ret;
 }
@@ -684,6 +758,15 @@ ssize_t loop_attr_do_store_use_aio(struct device *dev,
 		lo->use_aio = true;
 	else
 		lo->use_aio = false;
+
+	if (lo->use_aio != lo->can_use_aio) {
+		if (lo->use_aio)
+			return -EPERM;
+
+		lo->lo_backing_file->f_flags &= O_DIRECT;
+		lo->can_use_aio = false;
+	}
+
 	return count;
 }
 
@@ -803,6 +886,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	    !file->f_op->write)
 		lo_flags |= LO_FLAGS_READ_ONLY;
 
+#ifdef CONFIG_AIO
+	if (file->f_op->write_iter && file->f_op->read_iter &&
+	    mapping->a_ops->direct_IO) {
+		file->f_flags |= O_DIRECT;
+		lo->can_use_aio = true;
+	}
+#endif
+
 	lo_blocksize = S_ISBLK(inode->i_mode) ?
 		inode->i_bdev->bd_block_size : PAGE_SIZE;
 
@@ -836,6 +927,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
 	set_blocksize(bdev, lo_blocksize);
 
+	/*
+	 * We must not send too-small direct-io requests, so we reflect
+	 * the minimum io size to the loop device's logical block size
+	 */
+	if (lo->can_use_aio && inode->i_sb->s_bdev)
+		blk_queue_logical_block_size(lo->lo_queue,
+					     bdev_io_min(inode->i_sb->s_bdev));
+
 	lo->lo_state = Lo_bound;
 	if (part_shift)
 		lo->lo_flags |= LO_FLAGS_PARTSCAN;
@@ -1506,14 +1605,33 @@ int loop_unregister_transfer(int number)
 EXPORT_SYMBOL(loop_register_transfer);
 EXPORT_SYMBOL(loop_unregister_transfer);
 
+/* return true for single queue schedule */
+static bool loop_prep_sched_rq(struct loop_cmd *cmd)
+{
+	struct loop_device *lo = cmd->rq->q->queuedata;
+	bool single_queue = false;
+
+	cmd->use_aio = false;
+	if (lo->can_use_aio && (lo->transfer == transfer_none)) {
+		if (!(cmd->rq->cmd_flags & (REQ_FLUSH | REQ_DISCARD)))
+			cmd->use_aio = true;
+	}
+
+	if ((cmd->rq->cmd_flags & REQ_WRITE) || cmd->use_aio)
+		single_queue = true;
+
+	return single_queue;
+}
+
 static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	bool single_queue = loop_prep_sched_rq(cmd);
 
 	blk_mq_start_request(bd->rq);
 
-	if (cmd->rq->cmd_flags & REQ_WRITE) {
+	if (single_queue) {
 		struct loop_device *lo = cmd->rq->q->queuedata;
 		bool need_sched = true;
 
@@ -1551,7 +1669,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
  failed:
 	if (ret)
 		cmd->rq->errors = -EIO;
-	blk_mq_complete_request(cmd->rq);
+	if (!cmd->use_aio || ret)
+		blk_mq_complete_request(cmd->rq);
 }
 
 static void loop_queue_write_work(struct work_struct *work)
@@ -1653,6 +1772,19 @@ static int loop_add(struct loop_device **l, int i)
 	INIT_LIST_HEAD(&lo->write_cmd_head);
 	INIT_WORK(&lo->write_work, loop_queue_write_work);
 
+	blk_queue_max_segments(lo->lo_queue, LOOP_CMD_SEG_CNT);
+	blk_queue_max_hw_sectors(lo->lo_queue, -1U);
+	blk_queue_max_segment_size(lo->lo_queue, -1U);
+
+	/*
+	 * kernel aio can avoid double cache, decrease CPU load
+	 * and won't affect throughput much if I/O originates from
+	 * file system. But suggest to disable kernel aio via sysfs
+	 * for obtaining better throughput in case that loop block
+	 * device is accessed directly.
+	 */
+	lo->use_aio = true;
+
 	disk = lo->lo_disk = alloc_disk(1 << part_shift);
 	if (!disk)
 		goto out_free_queue;
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 15049e9..c917633 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -16,6 +16,8 @@
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <uapi/linux/loop.h>
+#include <linux/aio.h>
+#include <linux/scatterlist.h>
 
 /* Possible states of device */
 enum {
@@ -24,6 +26,9 @@ enum {
 	Lo_rundown,
 };
 
+#define LOOP_CMD_SEG_CNT    32
+#define LOOP_CMD_BVEC_CNT   (LOOP_CMD_SEG_CNT * 4)
+
 struct loop_func_table;
 
 struct loop_device {
@@ -58,6 +63,7 @@ struct loop_device {
 	struct work_struct	write_work;
 	bool			write_started;
 	bool			use_aio;
+	bool			can_use_aio;
 	int			lo_state;
 	struct mutex		lo_ctl_mutex;
 
@@ -70,6 +76,10 @@ struct loop_cmd {
 	struct work_struct read_work;
 	struct request *rq;
 	struct list_head list;
+	bool use_aio;
+	struct kiocb iocb;
+	struct bio_vec bv[LOOP_CMD_BVEC_CNT];
+	struct bio_vec *alloc_bv;
 };
 
 /* Support for loadable transfer modules */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] block & aio: improve loop with kernel aio
  2015-01-13 15:44 [PATCH v2 0/4] block & aio: improve loop with kernel aio Ming Lei
                   ` (3 preceding siblings ...)
  2015-01-13 15:44 ` [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based Ming Lei
@ 2015-01-13 16:23 ` Christoph Hellwig
  2015-01-14 10:17   ` Ming Lei
  4 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-13 16:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

On Tue, Jan 13, 2015 at 11:44:44PM +0800, Ming Lei wrote:
> Follows benefits from using kernel aio in loop:
> 	- avoid double cache, and memory usage decreased a lot
> 	- system load gets much decreased

This seems to conflate two changes:

 1) use direct I/O
 2) use aio (require 1)

what numbers do you get when just using direct I/O?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/4] block & aio: improve loop with kernel aio
  2015-01-13 16:23 ` [PATCH v2 0/4] block & aio: improve loop with kernel aio Christoph Hellwig
@ 2015-01-14 10:17   ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-14 10:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Benjamin LaHaise

On Wed, Jan 14, 2015 at 12:23 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 13, 2015 at 11:44:44PM +0800, Ming Lei wrote:
>> Follows benefits from using kernel aio in loop:
>>       - avoid double cache, and memory usage decreased a lot
>>       - system load gets much decreased
>
> This seems to conflate two changes:
>
>  1) use direct I/O
>  2) use aio (require 1)
>
> what numbers do you get when just using direct I/O?

I don't run the direct I/O only test because it needs quite
changes on current patches, such as: make the kiocb
as sync, allow all dio requests running concurrently for
sake of throughput.

IMO, without aio, context switches will be increased
inevitably, and CPU will be wasted, so could you explain
if there is advantage of not using aio?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
  2015-01-13 15:44   ` Ming Lei
  (?)
@ 2015-01-25 13:31   ` Christoph Hellwig
  2015-01-26 16:18       ` Ming Lei
  -1 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-25 13:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Benjamin LaHaise, linux-fsdevel, open list:AIO

> +struct kiocb *aio_kernel_alloc(gfp_t gfp)
> +{
> +	return kzalloc(sizeof(struct kiocb), gfp);
> +}
> +EXPORT_SYMBOL_GPL(aio_kernel_alloc);
> +
> +void aio_kernel_free(struct kiocb *iocb)
> +{
> +	kfree(iocb);
> +}
> +EXPORT_SYMBOL_GPL(aio_kernel_free);

Both functions don't actually seem to be used in this patch set.

> +void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
> +			size_t nr, loff_t off,
> +			void (*complete)(u64 user_data, long res),
> +			u64 user_data)

> +int aio_kernel_submit(struct kiocb *iocb, bool is_write,
> +		      struct iov_iter *iter)

Why do we keep these two separate?  Especially having the iov passed
n the second, and the count in the first seems rather confusing as
we shouldn't even need both for a high level API.  Also the private
data should really be a void pointer for the kernel, or simply be
left away as we can assume the iocb is embedded into a caller
data structure and container_of can be used to find that structure.

Also it might make sense to just offer aio_kernel_read/write intefaces
instead of the common submit wrapper, as that's much closer to other
kernel APIs, e.g.

int aio_kernel_read(struct kiocb *iocb, struct file *file,
		struct iov_iter *iter, loff_t off,
		void (*complete)(struct kiocb *iocb, long res));
int aio_kernel_write(struct kiocb *iocb, struct file *file,
		struct iov_iter *iter, loff_t off,
		void (*complete)(struct kiocb *iocb, long res));

> +	if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
> +			|| !iocb->ki_filp || !(iter->type & ITER_BVEC)))

Why do you want to limit what the iov_iter can contain?  iovec based
ones seem very useful, and athough I can come up with a use case
for vectors pointing to userspace address I can't see anything that
speaks against allowing them either.
call this from drivers deadling


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio
  2015-01-13 15:44 ` [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio Ming Lei
@ 2015-01-25 13:34   ` Christoph Hellwig
  2015-01-27 16:05     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-25 13:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Benjamin LaHaise, Omar Sandoval, linux-fsdevel

On Tue, Jan 13, 2015 at 11:44:46PM +0800, Ming Lei wrote:
>  
> -	if (dio->is_async && dio->rw == READ)
> +	if (dio->is_async && dio->rw == READ && dio->should_dirty)
>  		bio_set_pages_dirty(bio);
>  
>  	if (sdio->submit_io)
> @@ -463,13 +464,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
>  	if (!uptodate)
>  		dio->io_error = -EIO;
>  
> -	if (dio->is_async && dio->rw == READ) {
> +	if (dio->is_async && dio->rw == READ && dio->should_dirty) {

I'd rather have a ->should_dirrty flag that means we need to call
bio_check_pages_dirty, And set that either if we have a kernel
iovec/bvec or dio->is_async && dio->rw == READ.

But why would we even set this if writing from an iovec?

>  		bio_check_pages_dirty(bio);	/* transfers ownership */
>  	} else {
>  		bio_for_each_segment_all(bvec, bio, i) {
>  			struct page *page = bvec->bv_page;
>  
> -			if (dio->rw == READ && !PageCompound(page))
> +			if (dio->rw == READ && !PageCompound(page) &&
> +					dio->should_dirty)
>  				set_page_dirty_lock(page);

And this unk could also use some explanation.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file
  2015-01-13 15:44 ` [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file Ming Lei
@ 2015-01-25 13:35   ` Christoph Hellwig
  2015-01-27  5:26     ` Ming Lei
  2015-01-26 17:57   ` Jeff Moyer
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-25 13:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

On Tue, Jan 13, 2015 at 11:44:47PM +0800, Ming Lei wrote:
> So that users can control if kernel aio is used to submit I/O.

Why do you want a sysfs flag for this instead of a flags in ->lo_flags
at setup time?  That's how all other loop-related ABI flags work.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
  2015-01-13 15:44 ` [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based Ming Lei
@ 2015-01-25 13:40   ` Christoph Hellwig
  2015-03-18 18:28   ` Maxim Patlasov
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-25 13:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

> +{
> +	unsigned int i = 0;
> +	struct iov_iter iter;
> +	struct bio_vec *bvec, bv;
> +	size_t nr_segs = 0;
> +	struct req_iterator r_iter;
> +
> +	rq_for_each_segment(bv, cmd->rq, r_iter)
> +		nr_segs++;
> +
> +	if (nr_segs > LOOP_CMD_BVEC_CNT) {
> +		cmd->alloc_bv = kmalloc(nr_segs * sizeof(*cmd->alloc_bv),
> +					GFP_NOIO);
> +		if (!cmd->alloc_bv)
> +			return -ENOMEM;
> +		bvec = cmd->alloc_bv;
> +	} else {
> +		bvec = cmd->bv;
> +		cmd->alloc_bv = NULL;
> +	}
> +
> +	rq_for_each_segment(bv, cmd->rq, r_iter)
> +		bvec[i++] = bv;

What prevents us from simplify using the bvecs in the request?  Sure
we'd need to disable merging for this case, but it would avoid the
copy, and the potentival memory allocation.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
  2015-01-25 13:31   ` Christoph Hellwig
@ 2015-01-26 16:18       ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-26 16:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Benjamin LaHaise,
	Linux FS Devel, open list:AIO

On Sun, Jan 25, 2015 at 9:31 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> +struct kiocb *aio_kernel_alloc(gfp_t gfp)
>> +{
>> +     return kzalloc(sizeof(struct kiocb), gfp);
>> +}
>> +EXPORT_SYMBOL_GPL(aio_kernel_alloc);
>> +
>> +void aio_kernel_free(struct kiocb *iocb)
>> +{
>> +     kfree(iocb);
>> +}
>> +EXPORT_SYMBOL_GPL(aio_kernel_free);
>
> Both functions don't actually seem to be used in this patch set.

My fault, and it is just v2 which stops using them.

>> +void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
>> +                     size_t nr, loff_t off,
>> +                     void (*complete)(u64 user_data, long res),
>> +                     u64 user_data)
>
>> +int aio_kernel_submit(struct kiocb *iocb, bool is_write,
>> +                   struct iov_iter *iter)
>
> Why do we keep these two separate?  Especially having the iov passed

No special meaning, just follow previous patches, :-)

But one benefit is that we can separate the one-shot
initialization from submit, at least filep/complete/ki_ctx can be
set during initialization.

> n the second, and the count in the first seems rather confusing as
> we shouldn't even need both for a high level API.  Also the private
> data should really be a void pointer for the kernel, or simply be
> left away as we can assume the iocb is embedded into a caller
> data structure and container_of can be used to find that structure.

Either one is OK.

> Also it might make sense to just offer aio_kernel_read/write intefaces
> instead of the common submit wrapper, as that's much closer to other
> kernel APIs, e.g.
>
> int aio_kernel_read(struct kiocb *iocb, struct file *file,
>                 struct iov_iter *iter, loff_t off,
>                 void (*complete)(struct kiocb *iocb, long res));
> int aio_kernel_write(struct kiocb *iocb, struct file *file,
>                 struct iov_iter *iter, loff_t off,
>                 void (*complete)(struct kiocb *iocb, long res));

It is like style of sync APIs, looks submit/complete is common
for async APIs, like io_submit().

>> +     if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
>> +                     || !iocb->ki_filp || !(iter->type & ITER_BVEC)))
>
> Why do you want to limit what the iov_iter can contain?  iovec based
> ones seem very useful, and athough I can come up with a use case
> for vectors pointing to userspace address I can't see anything that
> speaks against allowing them either.
> call this from drivers deadling

Yes, we should allow KVEC at least.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
@ 2015-01-26 16:18       ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-26 16:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Benjamin LaHaise,
	Linux FS Devel, open list:AIO

On Sun, Jan 25, 2015 at 9:31 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> +struct kiocb *aio_kernel_alloc(gfp_t gfp)
>> +{
>> +     return kzalloc(sizeof(struct kiocb), gfp);
>> +}
>> +EXPORT_SYMBOL_GPL(aio_kernel_alloc);
>> +
>> +void aio_kernel_free(struct kiocb *iocb)
>> +{
>> +     kfree(iocb);
>> +}
>> +EXPORT_SYMBOL_GPL(aio_kernel_free);
>
> Both functions don't actually seem to be used in this patch set.

My fault, and it is just v2 which stops using them.

>> +void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
>> +                     size_t nr, loff_t off,
>> +                     void (*complete)(u64 user_data, long res),
>> +                     u64 user_data)
>
>> +int aio_kernel_submit(struct kiocb *iocb, bool is_write,
>> +                   struct iov_iter *iter)
>
> Why do we keep these two separate?  Especially having the iov passed

No special meaning, just follow previous patches, :-)

But one benefit is that we can separate the one-shot
initialization from submit, at least filep/complete/ki_ctx can be
set during initialization.

> n the second, and the count in the first seems rather confusing as
> we shouldn't even need both for a high level API.  Also the private
> data should really be a void pointer for the kernel, or simply be
> left away as we can assume the iocb is embedded into a caller
> data structure and container_of can be used to find that structure.

Either one is OK.

> Also it might make sense to just offer aio_kernel_read/write intefaces
> instead of the common submit wrapper, as that's much closer to other
> kernel APIs, e.g.
>
> int aio_kernel_read(struct kiocb *iocb, struct file *file,
>                 struct iov_iter *iter, loff_t off,
>                 void (*complete)(struct kiocb *iocb, long res));
> int aio_kernel_write(struct kiocb *iocb, struct file *file,
>                 struct iov_iter *iter, loff_t off,
>                 void (*complete)(struct kiocb *iocb, long res));

It is like style of sync APIs, looks submit/complete is common
for async APIs, like io_submit().

>> +     if (WARN_ON(!is_kernel_kiocb(iocb) || !iocb->ki_obj.complete
>> +                     || !iocb->ki_filp || !(iter->type & ITER_BVEC)))
>
> Why do you want to limit what the iov_iter can contain?  iovec based
> ones seem very useful, and athough I can come up with a use case
> for vectors pointing to userspace address I can't see anything that
> speaks against allowing them either.
> call this from drivers deadling

Yes, we should allow KVEC at least.


Thanks,
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
  2015-01-26 16:18       ` Ming Lei
  (?)
@ 2015-01-26 17:00       ` Christoph Hellwig
  2015-01-27 13:57         ` Ming Lei
  -1 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-26 17:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Dave Kleikamp,
	Jens Axboe, Zach Brown, Maxim Patlasov, Andrew Morton,
	Alexander Viro, Benjamin LaHaise, Linux FS Devel, open list:AIO

On Tue, Jan 27, 2015 at 12:18:23AM +0800, Ming Lei wrote:
> > Also it might make sense to just offer aio_kernel_read/write intefaces
> > instead of the common submit wrapper, as that's much closer to other
> > kernel APIs, e.g.
> >
> > int aio_kernel_read(struct kiocb *iocb, struct file *file,
> >                 struct iov_iter *iter, loff_t off,
> >                 void (*complete)(struct kiocb *iocb, long res));
> > int aio_kernel_write(struct kiocb *iocb, struct file *file,
> >                 struct iov_iter *iter, loff_t off,
> >                 void (*complete)(struct kiocb *iocb, long res));
> 
> It is like style of sync APIs, looks submit/complete is common
> for async APIs, like io_submit().

io_submit is a fairly horrible API.  While Posix AIO isn't too much
better it does have separate APIs for read/write.  Given that there
isn't much code shared I'd keep an API that feels familar.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file
  2015-01-13 15:44 ` [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file Ming Lei
  2015-01-25 13:35   ` Christoph Hellwig
@ 2015-01-26 17:57   ` Jeff Moyer
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Moyer @ 2015-01-26 17:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

Ming Lei <ming.lei@canonical.com> writes:

> So that users can control if kernel aio is used to submit I/O.

How would a user know to choose aio or, um, /not/ aio?  At the very
least, documentation is required for this.  I'd rather see the option
disappear completely, though.

Cheers,
Jeff

>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  drivers/block/loop.c |   33 +++++++++++++++++++++++++++++++++
>  drivers/block/loop.h |    1 +
>  2 files changed, 34 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d1f168b..47af456 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -659,6 +659,38 @@ static ssize_t loop_attr_partscan_show(struct loop_device *lo, char *buf)
>  	return sprintf(buf, "%s\n", partscan ? "1" : "0");
>  }
>  
> +static ssize_t loop_attr_do_show_use_aio(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct gendisk *disk = dev_to_disk(dev);
> +	struct loop_device *lo = disk->private_data;
> +
> +	return sprintf(buf, "%s\n", lo->use_aio ? "1" : "0");
> +}
> +
> +ssize_t loop_attr_do_store_use_aio(struct device *dev,
> +		struct device_attribute *attr, const char *buf,
> +		size_t count)
> +{
> +	struct gendisk *disk = dev_to_disk(dev);
> +	struct loop_device *lo = disk->private_data;
> +	int err;
> +	unsigned long v;
> +
> +	err = kstrtoul(buf, 10, &v);
> +	if (err < 0)
> +		return err;
> +	if (v)
> +		lo->use_aio = true;
> +	else
> +		lo->use_aio = false;
> +	return count;
> +}
> +
> +static struct device_attribute loop_attr_use_aio =
> +	__ATTR(use_aio, S_IRUGO | S_IWUSR, loop_attr_do_show_use_aio,
> +			loop_attr_do_store_use_aio);
> +
>  LOOP_ATTR_RO(backing_file);
>  LOOP_ATTR_RO(offset);
>  LOOP_ATTR_RO(sizelimit);
> @@ -671,6 +703,7 @@ static struct attribute *loop_attrs[] = {
>  	&loop_attr_sizelimit.attr,
>  	&loop_attr_autoclear.attr,
>  	&loop_attr_partscan.attr,
> +	&loop_attr_use_aio.attr,
>  	NULL,
>  };
>  
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index 301c27f..15049e9 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -57,6 +57,7 @@ struct loop_device {
>  	struct list_head	write_cmd_head;
>  	struct work_struct	write_work;
>  	bool			write_started;
> +	bool			use_aio;
>  	int			lo_state;
>  	struct mutex		lo_ctl_mutex;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file
  2015-01-25 13:35   ` Christoph Hellwig
@ 2015-01-27  5:26     ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-27  5:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Benjamin LaHaise,
	Jeff Moyer

On 1/25/15, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 13, 2015 at 11:44:47PM +0800, Ming Lei wrote:
>> So that users can control if kernel aio is used to submit I/O.
>
> Why do you want a sysfs flag for this instead of a flags in ->lo_flags
> at setup time?  That's how all other loop-related ABI flags work.

The flag only effects performance or CPU/memory resource utilization,
and I plan to change it as dio flag.

On Tue, Jan 27, 2015 at 1:57 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Ming Lei <ming.lei@canonical.com> writes:
>
>> So that users can control if kernel aio is used to submit I/O.
>
> How would a user know to choose aio or, um, /not/ aio?  At the very
> least, documentation is required for this.  I'd rather see the option
> disappear completely, though.

As the comment says in patch 4/4, using direct I/O can save memory/CPU
a lot, and won't hurt throughput if I/O is from filesystem. Otherwise,
user may choose to not use direct I/O for sake of throughput.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
  2015-01-26 17:00       ` Christoph Hellwig
@ 2015-01-27 13:57         ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-27 13:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Benjamin LaHaise,
	Linux FS Devel, open list:AIO

On Tue, Jan 27, 2015 at 1:00 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 27, 2015 at 12:18:23AM +0800, Ming Lei wrote:
>> > Also it might make sense to just offer aio_kernel_read/write intefaces
>> > instead of the common submit wrapper, as that's much closer to other
>> > kernel APIs, e.g.
>> >
>> > int aio_kernel_read(struct kiocb *iocb, struct file *file,
>> >                 struct iov_iter *iter, loff_t off,
>> >                 void (*complete)(struct kiocb *iocb, long res));
>> > int aio_kernel_write(struct kiocb *iocb, struct file *file,
>> >                 struct iov_iter *iter, loff_t off,
>> >                 void (*complete)(struct kiocb *iocb, long res));
>>
>> It is like style of sync APIs, looks submit/complete is common
>> for async APIs, like io_submit().
>
> io_submit is a fairly horrible API.  While Posix AIO isn't too much
> better it does have separate APIs for read/write.  Given that there
> isn't much code shared I'd keep an API that feels familar.

But from performance view, the two APIs of aio_kernel_read()
and aio_kernel_write() aren't good because both need
5 parameters which can't be held in registers and they are
often called from single thread context, not like synchronized
read() and write().

Also aio_kernel_init_rw() can be moved to header file as inline
to avoid too many parameters.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio
  2015-01-25 13:34   ` Christoph Hellwig
@ 2015-01-27 16:05     ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-01-27 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Benjamin LaHaise,
	Omar Sandoval, linux-fsdevel

On 1/25/15, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 13, 2015 at 11:44:46PM +0800, Ming Lei wrote:
>>
>> -	if (dio->is_async && dio->rw == READ)
>> +	if (dio->is_async && dio->rw == READ && dio->should_dirty)
>>  		bio_set_pages_dirty(bio);
>>
>>  	if (sdio->submit_io)
>> @@ -463,13 +464,14 @@ static int dio_bio_complete(struct dio *dio, struct
>> bio *bio)
>>  	if (!uptodate)
>>  		dio->io_error = -EIO;
>>
>> -	if (dio->is_async && dio->rw == READ) {
>> +	if (dio->is_async && dio->rw == READ && dio->should_dirty) {
>
> I'd rather have a ->should_dirrty flag that means we need to call
> bio_check_pages_dirty, And set that either if we have a kernel
> iovec/bvec or dio->is_async && dio->rw == READ.

It may not work for loop because the page may not become dirty
until the read dio is completed, so it just means the caller totally
handles the page dirty if I/O is from kernel.

>
> But why would we even set this if writing from an iovec?

Or just rename it as ->from_user?

>>  		bio_check_pages_dirty(bio);	/* transfers ownership */
>>  	} else {
>>  		bio_for_each_segment_all(bvec, bio, i) {
>>  			struct page *page = bvec->bv_page;
>>
>> -			if (dio->rw == READ && !PageCompound(page))
>> +			if (dio->rw == READ && !PageCompound(page) &&
>> +					dio->should_dirty)
>>  				set_page_dirty_lock(page);
>
> And this unk could also use some explanation.

This is same, just don't handle page dirty for kernel dio pages.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
  2015-01-26 16:18       ` Ming Lei
@ 2015-01-27 17:59         ` Christoph Hellwig
  -1 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-27 17:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Dave Kleikamp,
	Jens Axboe, Zach Brown, Maxim Patlasov, Andrew Morton,
	Alexander Viro, Benjamin LaHaise, Linux FS Devel, open list:AIO

On Tue, Jan 27, 2015 at 12:18:23AM +0800, Ming Lei wrote:
> > Why do we keep these two separate?  Especially having the iov passed
> 
> No special meaning, just follow previous patches, :-)
> 
> But one benefit is that we can separate the one-shot
> initialization from submit, at least filep/complete/ki_ctx can be
> set during initialization.

FYI, I've posted another approach at async kernel reads/writes in the
"[RFC] split struct kiocb" series.  I thought I had you on Cc, but I
messed it up.

This keeps the separate init function, but it still feels a bit
pointless to me.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] aio: add aio_kernel_() interface
@ 2015-01-27 17:59         ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2015-01-27 17:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Dave Kleikamp,
	Jens Axboe, Zach Brown, Maxim Patlasov, Andrew Morton,
	Alexander Viro, Benjamin LaHaise, Linux FS Devel, open list:AIO

On Tue, Jan 27, 2015 at 12:18:23AM +0800, Ming Lei wrote:
> > Why do we keep these two separate?  Especially having the iov passed
> 
> No special meaning, just follow previous patches, :-)
> 
> But one benefit is that we can separate the one-shot
> initialization from submit, at least filep/complete/ki_ctx can be
> set during initialization.

FYI, I've posted another approach at async kernel reads/writes in the
"[RFC] split struct kiocb" series.  I thought I had you on Cc, but I
messed it up.

This keeps the separate init function, but it still feels a bit
pointless to me.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
  2015-01-13 15:44 ` [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based Ming Lei
  2015-01-25 13:40   ` Christoph Hellwig
@ 2015-03-18 18:28   ` Maxim Patlasov
  2015-03-19  2:57     ` Ming Lei
  1 sibling, 1 reply; 25+ messages in thread
From: Maxim Patlasov @ 2015-03-18 18:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

On 01/13/2015 07:44 AM, Ming Lei wrote:
> Part of the patch is based on Dave's previous post.
>
> This patch submits I/O to fs via kernel aio, and we
> can obtain following benefits:
>
> 	- double cache in both loop file system and backend file
> 	gets avoided
> 	- context switch decreased a lot, and finally CPU utilization
> 	is decreased
> 	- cached memory got decreased a lot
>
> One main side effect is that throughput is decreased when
> accessing raw loop block(not by filesystem) with kernel aio.
>
> This patch has passed xfstests test(./check -g auto), and
> both test and scratch devices are loop block, file system is ext4.
>
> Follows two fio tests' result:
>
> 1. fio test inside ext4 file system over loop block
> 1) How to run
> 	- linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
> 	- loop over SSD image 1 in ext4
> 	- linux psync, 16 jobs, size 200M, ext4 over loop block
> 	- test result: IOPS from fio output
>
> 2) Throughput result:
> 	-------------------------------------------------------------
> 	test cases          |randread   |read   |randwrite  |write  |
> 	-------------------------------------------------------------
> 	base                |16799      |59508  |31059      |58829
> 	-------------------------------------------------------------
> 	base+kernel aio     |15480      |64453  |30187      |57222
> 	-------------------------------------------------------------

Ming, it's important to understand the overhead of aio_kernel_() 
implementation. So could you please add test results for raw SSD device 
to the table above next time (in v3 of your patches).

Jens, if you have some fast storage at hand, could you please measure 
IOPS for Ming's patches vs. raw block device -- to ensure that the 
patches do not impose too low limit on performance.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
  2015-03-18 18:28   ` Maxim Patlasov
@ 2015-03-19  2:57     ` Ming Lei
  2015-03-19 16:37       ` Maxim Patlasov
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2015-03-19  2:57 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

On Thu, Mar 19, 2015 at 2:28 AM, Maxim Patlasov <mpatlasov@parallels.com> wrote:
> On 01/13/2015 07:44 AM, Ming Lei wrote:
>>
>> Part of the patch is based on Dave's previous post.
>>
>> This patch submits I/O to fs via kernel aio, and we
>> can obtain following benefits:
>>
>>         - double cache in both loop file system and backend file
>>         gets avoided
>>         - context switch decreased a lot, and finally CPU utilization
>>         is decreased
>>         - cached memory got decreased a lot
>>
>> One main side effect is that throughput is decreased when
>> accessing raw loop block(not by filesystem) with kernel aio.
>>
>> This patch has passed xfstests test(./check -g auto), and
>> both test and scratch devices are loop block, file system is ext4.
>>
>> Follows two fio tests' result:
>>
>> 1. fio test inside ext4 file system over loop block
>> 1) How to run
>>         - linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
>>         - loop over SSD image 1 in ext4
>>         - linux psync, 16 jobs, size 200M, ext4 over loop block
>>         - test result: IOPS from fio output
>>
>> 2) Throughput result:
>>         -------------------------------------------------------------
>>         test cases          |randread   |read   |randwrite  |write  |
>>         -------------------------------------------------------------
>>         base                |16799      |59508  |31059      |58829
>>         -------------------------------------------------------------
>>         base+kernel aio     |15480      |64453  |30187      |57222
>>         -------------------------------------------------------------
>
>
> Ming, it's important to understand the overhead of aio_kernel_()
> implementation. So could you please add test results for raw SSD device to
> the table above next time (in v3 of your patches).

what aio_kernel_() does is to just call ->read_iter()/->write_iter(),
so it should not have introduced extra overload.

>From performance view, the effect is only from switching to
O_DIRECT. With O_DIRECT, double cache can be avoided,
meantime both page caches and CPU utilization can be decreased.

V3 need to rewrite against recent vfs's change(in -next already)
about splitting kiocb patches, and thanks to Christoph's patches,
aio_kernel_() can be implemented just inside loop and it may
become a bit simple.

>
> Jens, if you have some fast storage at hand, could you please measure IOPS
> for Ming's patches vs. raw block device -- to ensure that the patches do not
> impose too low limit on performance.
>
> Thanks,
> Maxim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
  2015-03-19  2:57     ` Ming Lei
@ 2015-03-19 16:37       ` Maxim Patlasov
  2015-03-20  5:27         ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Maxim Patlasov @ 2015-03-19 16:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

On 03/18/2015 07:57 PM, Ming Lei wrote:
> On Thu, Mar 19, 2015 at 2:28 AM, Maxim Patlasov <mpatlasov@parallels.com> wrote:
>> On 01/13/2015 07:44 AM, Ming Lei wrote:
>>> Part of the patch is based on Dave's previous post.
>>>
>>> This patch submits I/O to fs via kernel aio, and we
>>> can obtain following benefits:
>>>
>>>          - double cache in both loop file system and backend file
>>>          gets avoided
>>>          - context switch decreased a lot, and finally CPU utilization
>>>          is decreased
>>>          - cached memory got decreased a lot
>>>
>>> One main side effect is that throughput is decreased when
>>> accessing raw loop block(not by filesystem) with kernel aio.
>>>
>>> This patch has passed xfstests test(./check -g auto), and
>>> both test and scratch devices are loop block, file system is ext4.
>>>
>>> Follows two fio tests' result:
>>>
>>> 1. fio test inside ext4 file system over loop block
>>> 1) How to run
>>>          - linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
>>>          - loop over SSD image 1 in ext4
>>>          - linux psync, 16 jobs, size 200M, ext4 over loop block
>>>          - test result: IOPS from fio output
>>>
>>> 2) Throughput result:
>>>          -------------------------------------------------------------
>>>          test cases          |randread   |read   |randwrite  |write  |
>>>          -------------------------------------------------------------
>>>          base                |16799      |59508  |31059      |58829
>>>          -------------------------------------------------------------
>>>          base+kernel aio     |15480      |64453  |30187      |57222
>>>          -------------------------------------------------------------
>>
>> Ming, it's important to understand the overhead of aio_kernel_()
>> implementation. So could you please add test results for raw SSD device to
>> the table above next time (in v3 of your patches).
> what aio_kernel_() does is to just call ->read_iter()/->write_iter(),
> so it should not have introduced extra overload.
>
>  From performance view, the effect is only from switching to
> O_DIRECT. With O_DIRECT, double cache can be avoided,
> meantime both page caches and CPU utilization can be decreased.

The way how you reused loop_queue_rq() --> queue_work() functionality 
(added early, by commit b5dd2f604) may affect performance of O_DIRECT 
operations. It can be easily demonstrated on ram-drive, but measurements 
on real storage h/w would be more convincing.

Btw, when you wrote "linux psync, 16 jobs, size 200M, ext4 over loop 
block" -- does it mean that there were 16 threads in userspace 
submitting I/O concurrently? If yes, throughput comparison for a single 
job test would be also useful to look at.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based
  2015-03-19 16:37       ` Maxim Patlasov
@ 2015-03-20  5:27         ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2015-03-20  5:27 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Andrew Morton, Alexander Viro,
	Benjamin LaHaise

On Fri, Mar 20, 2015 at 12:37 AM, Maxim Patlasov
<mpatlasov@parallels.com> wrote:
> On 03/18/2015 07:57 PM, Ming Lei wrote:
>>
>> On Thu, Mar 19, 2015 at 2:28 AM, Maxim Patlasov <mpatlasov@parallels.com>
>> wrote:
>>>
>>> On 01/13/2015 07:44 AM, Ming Lei wrote:
>>>>
>>>> Part of the patch is based on Dave's previous post.
>>>>
>>>> This patch submits I/O to fs via kernel aio, and we
>>>> can obtain following benefits:
>>>>
>>>>          - double cache in both loop file system and backend file
>>>>          gets avoided
>>>>          - context switch decreased a lot, and finally CPU utilization
>>>>          is decreased
>>>>          - cached memory got decreased a lot
>>>>
>>>> One main side effect is that throughput is decreased when
>>>> accessing raw loop block(not by filesystem) with kernel aio.
>>>>
>>>> This patch has passed xfstests test(./check -g auto), and
>>>> both test and scratch devices are loop block, file system is ext4.
>>>>
>>>> Follows two fio tests' result:
>>>>
>>>> 1. fio test inside ext4 file system over loop block
>>>> 1) How to run
>>>>          - linux kernel base: 3.19.0-rc3-next-20150108(loop-mq merged)
>>>>          - loop over SSD image 1 in ext4
>>>>          - linux psync, 16 jobs, size 200M, ext4 over loop block
>>>>          - test result: IOPS from fio output
>>>>
>>>> 2) Throughput result:
>>>>          -------------------------------------------------------------
>>>>          test cases          |randread   |read   |randwrite  |write  |
>>>>          -------------------------------------------------------------
>>>>          base                |16799      |59508  |31059      |58829
>>>>          -------------------------------------------------------------
>>>>          base+kernel aio     |15480      |64453  |30187      |57222
>>>>          -------------------------------------------------------------
>>>
>>>
>>> Ming, it's important to understand the overhead of aio_kernel_()
>>> implementation. So could you please add test results for raw SSD device
>>> to
>>> the table above next time (in v3 of your patches).
>>
>> what aio_kernel_() does is to just call ->read_iter()/->write_iter(),
>> so it should not have introduced extra overload.
>>
>>  From performance view, the effect is only from switching to
>> O_DIRECT. With O_DIRECT, double cache can be avoided,
>> meantime both page caches and CPU utilization can be decreased.
>
>
> The way how you reused loop_queue_rq() --> queue_work() functionality (added
> early, by commit b5dd2f604) may affect performance of O_DIRECT operations.
> It can be easily demonstrated on ram-drive, but measurements on real storage
> h/w would be more convincing.

The test data in the commit log is on real storage h/w, which is attached
to one sata 3.0Gbps drive.

blk-mq may affect performance a bit on ram-drive too, which can be
demonstrated from null_blk test(blk_mq vs. bio), but looks not
a big deal since it isn't a real use case.

>
> Btw, when you wrote "linux psync, 16 jobs, size 200M, ext4 over loop block"
> -- does it mean that there were 16 threads in userspace submitting I/O
> concurrently? If yes, throughput comparison for a single job test would be
> also useful to look at.

Yes, it is the 'numjobs' in fio config file because performance can only
be got higher for 'sync' I/O by increasing number of I/O threads.

No problem, throughput comparison for single job will be provided in V3.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-03-20  5:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-13 15:44 [PATCH v2 0/4] block & aio: improve loop with kernel aio Ming Lei
2015-01-13 15:44 ` [PATCH v2 1/4] aio: add aio_kernel_() interface Ming Lei
2015-01-13 15:44   ` Ming Lei
2015-01-25 13:31   ` Christoph Hellwig
2015-01-26 16:18     ` Ming Lei
2015-01-26 16:18       ` Ming Lei
2015-01-26 17:00       ` Christoph Hellwig
2015-01-27 13:57         ` Ming Lei
2015-01-27 17:59       ` Christoph Hellwig
2015-01-27 17:59         ` Christoph Hellwig
2015-01-13 15:44 ` [PATCH v2 2/4] fd/direct-io: introduce should_dirty for kernel aio Ming Lei
2015-01-25 13:34   ` Christoph Hellwig
2015-01-27 16:05     ` Ming Lei
2015-01-13 15:44 ` [PATCH v2 3/4] block: loop: introduce 'use_aio' sysfs file Ming Lei
2015-01-25 13:35   ` Christoph Hellwig
2015-01-27  5:26     ` Ming Lei
2015-01-26 17:57   ` Jeff Moyer
2015-01-13 15:44 ` [PATCH v2 4/4] block: loop: support to submit I/O via kernel aio based Ming Lei
2015-01-25 13:40   ` Christoph Hellwig
2015-03-18 18:28   ` Maxim Patlasov
2015-03-19  2:57     ` Ming Lei
2015-03-19 16:37       ` Maxim Patlasov
2015-03-20  5:27         ` Ming Lei
2015-01-13 16:23 ` [PATCH v2 0/4] block & aio: improve loop with kernel aio Christoph Hellwig
2015-01-14 10:17   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.