linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] block: loop: improve loop with AIO
@ 2015-05-29  6:35 Ming Lei
  2015-05-29  6:35 ` [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Ming Lei @ 2015-05-29  6:35 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner

Hi Guys,

There are about 3 advantages to use direct I/O and AIO on
read/write loop's backing file:

1) double cache can be avoided, then memory usage gets
decreased a lot

2) not like user space direct I/O, there isn't cost of
pinning pages

3) avoid context switch for obtaining good throughput
- in buffered file read, random I/O throughput is often obtained
only if they are submitted concurrently from lots of tasks; but for
sequential I/O, most of times they can be hit from page cache, so
concurrent submissions often introduce unnecessary context switch
and can't improve throughput much. There was such discussion[1]
to use non-blocking I/O to improve the problem for application.
- with direct I/O and AIO, concurrent submissions can be
avoided and random read throughput can't be affected meantime

So this patchset trys to improve loop via AIO, and about 45% memory
usage can be decreased, see detailed data in commit log of patch4,
also IO throughput isn't affected too.

V4:
	- add detailed commit log for 'use kthread_work'
	- allow userspace(sysfs, losetup) to decide if dio/aio is
	used as suggested by Christoph and Dave Chinner
	- only use dio if the backing block device's min io size
	is 512 as pointed by Dave Chinner & Christoph

V3:
	- based on Al's iov_iter work and Christoph's kiocb changes
	- use kthread_work
	- introduce IOCB_DONT_DIRTY_PAGE flag
	- set QUEUE_FLAG_NOMERGES for loop's request queue
V2:
	- remove 'extra' parameter to aio_kernel_alloc()
	- try to avoid memory allcation inside queue req callback
	- introduce 'use_mq' sysfs file for enabling kernel aio or disabling it
V1:
	- link:
                http://marc.info/?t=140803157700004&r=1&w=2
	- improve failure path in aio_kernel_submit()



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO
  2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
@ 2015-05-29  6:35 ` Ming Lei
  2015-06-05 15:03   ` Christoph Hellwig
  2015-05-29  6:35 ` [PATCH v4 2/5] block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2015-05-29  6:35 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner, Ming Lei

When direct IO is submitted from kernel, it is often unnecessary
to dirty pages, for example of loop, dirtying pages have been
considered in the upper filesystem(over loop) side already, and
they don't need to be dirtied again.

So this patch introduces IOCB_DONT_DIRTY_PAGE flag for direct IO,
and other ITER_BVEC and ITER_KVEC cases might benefit from it too,
then it can be used for other dio cases which may have different
rules to deal with page dirtying.

The patch is based on previous Dave's patch.

Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 fs/direct-io.c     | 9 ++++++---
 include/linux/fs.h | 1 +
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 745d234..78964c0 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -120,6 +120,7 @@ struct dio {
 	int page_errors;		/* errno from get_user_pages() */
 	int is_async;			/* is IO async ? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
+	bool should_dirty;		/* if pages should be dirtied */
 	int io_error;			/* IO error in completion path */
 	unsigned long refcount;		/* direct_io_worker() and bios */
 	struct bio *bio_list;		/* singly linked via bi_private */
@@ -393,7 +394,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-	if (dio->is_async && dio->rw == READ)
+	if (dio->is_async && dio->rw == READ && dio->should_dirty)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
@@ -464,13 +465,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (!uptodate)
 		dio->io_error = -EIO;
 
-	if (dio->is_async && dio->rw == READ) {
+	if (dio->is_async && dio->rw == READ && dio->should_dirty) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		bio_for_each_segment_all(bvec, bio, i) {
 			struct page *page = bvec->bv_page;
 
-			if (dio->rw == READ && !PageCompound(page))
+			if (dio->rw == READ && !PageCompound(page) &&
+					dio->should_dirty)
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
@@ -1217,6 +1219,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
+	dio->should_dirty = !(iocb->ki_flags & IOCB_DONT_DIRTY_PAGE);
 	sdio.iter = iter;
 	sdio.final_block_in_request =
 		(offset + iov_iter_count(iter)) >> blkbits;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ff4135e..54027ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -316,6 +316,7 @@ struct writeback_control;
 #define IOCB_EVENTFD		(1 << 0)
 #define IOCB_APPEND		(1 << 1)
 #define IOCB_DIRECT		(1 << 2)
+#define IOCB_DONT_DIRTY_PAGE	(1 << 3)
 
 struct kiocb {
 	struct file		*ki_filp;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 2/5] block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop
  2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
  2015-05-29  6:35 ` [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO Ming Lei
@ 2015-05-29  6:35 ` Ming Lei
  2015-05-29  6:35 ` [PATCH v4 3/5] block: loop: use kthread_work Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2015-05-29  6:35 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner, Ming Lei

It doesn't make sense to enable merge because the I/O
submitted to backing file is handled page by page.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 40580dc..10cc583 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1595,6 +1595,12 @@ static int loop_add(struct loop_device **l, int i)
 	}
 	lo->lo_queue->queuedata = lo;
 
+	/*
+	 * It doesn't make sense to enable merge because the I/O
+	 * submitted to backing file is handled page by page.
+	 */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
+
 	INIT_LIST_HEAD(&lo->write_cmd_head);
 	INIT_WORK(&lo->write_work, loop_queue_write_work);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 3/5] block: loop: use kthread_work
  2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
  2015-05-29  6:35 ` [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO Ming Lei
  2015-05-29  6:35 ` [PATCH v4 2/5] block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop Ming Lei
@ 2015-05-29  6:35 ` Ming Lei
  2015-05-29  6:35 ` [PATCH v4 4/5] block: loop: prepare for supporing direct IO Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2015-05-29  6:35 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner, Ming Lei

The following patch will use dio/aio to submit IO to backing file,
then it needn't to schedule IO concurrently from work, so
use kthread_work for decreasing context switch cost a lot.

For non-AIO case, single thread has been used for long long time,
and it was just converted to work in v4.0, which has caused performance
regression for fedora live booting already. In discussion[1], even
though submitting I/O via work concurrently can improve random read IO
throughput, meantime it might hurt sequential read IO performance, so
better to restore to single thread behaviour.

For the following AIO support, it is better to use multi hw-queue
with per-hwq kthread than current work approach suppose there is so
high performance requirement for loop.

[1] http://marc.info/?t=143082678400002&r=1&w=2

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c | 79 ++++++++++++++++------------------------------------
 drivers/block/loop.h | 10 +++----
 2 files changed, 28 insertions(+), 61 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 10cc583..555b895 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -688,6 +688,23 @@ static void loop_config_discard(struct loop_device *lo)
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
+static void loop_unprepare_queue(struct loop_device *lo)
+{
+	flush_kthread_worker(&lo->worker);
+	kthread_stop(lo->worker_task);
+}
+
+static int loop_prepare_queue(struct loop_device *lo)
+{
+	init_kthread_worker(&lo->worker);
+	lo->worker_task = kthread_run(kthread_worker_fn,
+			&lo->worker, "loop%d", lo->lo_number);
+	if (IS_ERR(lo->worker_task))
+		return -ENOMEM;
+	set_user_nice(lo->worker_task, MIN_NICE);
+	return 0;
+}
+
 static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 		       struct block_device *bdev, unsigned int arg)
 {
@@ -745,11 +762,8 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	size = get_loop_size(lo, file);
 	if ((loff_t)(sector_t)size != size)
 		goto out_putf;
-	error = -ENOMEM;
-	lo->wq = alloc_workqueue("kloopd%d",
-			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 16,
-			lo->lo_number);
-	if (!lo->wq)
+	error = loop_prepare_queue(lo);
+	if (error)
 		goto out_putf;
 
 	error = 0;
@@ -903,8 +917,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-	destroy_workqueue(lo->wq);
-	lo->wq = NULL;
+	loop_unprepare_queue(lo);
 	mutex_unlock(&lo->lo_ctl_mutex);
 	/*
 	 * Need not hold lo_ctl_mutex to fput backing file.
@@ -1461,23 +1474,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (lo->lo_state != Lo_bound)
 		return -EIO;
 
-	if (cmd->rq->cmd_flags & REQ_WRITE) {
-		struct loop_device *lo = cmd->rq->q->queuedata;
-		bool need_sched = true;
-
-		spin_lock_irq(&lo->lo_lock);
-		if (lo->write_started)
-			need_sched = false;
-		else
-			lo->write_started = true;
-		list_add_tail(&cmd->list, &lo->write_cmd_head);
-		spin_unlock_irq(&lo->lo_lock);
-
-		if (need_sched)
-			queue_work(lo->wq, &lo->write_work);
-	} else {
-		queue_work(lo->wq, &cmd->read_work);
-	}
+	queue_kthread_work(&lo->worker, &cmd->work);
 
 	return BLK_MQ_RQ_QUEUE_OK;
 }
@@ -1499,35 +1496,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
 	blk_mq_complete_request(cmd->rq);
 }
 
-static void loop_queue_write_work(struct work_struct *work)
-{
-	struct loop_device *lo =
-		container_of(work, struct loop_device, write_work);
-	LIST_HEAD(cmd_list);
-
-	spin_lock_irq(&lo->lo_lock);
- repeat:
-	list_splice_init(&lo->write_cmd_head, &cmd_list);
-	spin_unlock_irq(&lo->lo_lock);
-
-	while (!list_empty(&cmd_list)) {
-		struct loop_cmd *cmd = list_first_entry(&cmd_list,
-				struct loop_cmd, list);
-		list_del_init(&cmd->list);
-		loop_handle_cmd(cmd);
-	}
-
-	spin_lock_irq(&lo->lo_lock);
-	if (!list_empty(&lo->write_cmd_head))
-		goto repeat;
-	lo->write_started = false;
-	spin_unlock_irq(&lo->lo_lock);
-}
-
-static void loop_queue_read_work(struct work_struct *work)
+static void loop_queue_work(struct kthread_work *work)
 {
 	struct loop_cmd *cmd =
-		container_of(work, struct loop_cmd, read_work);
+		container_of(work, struct loop_cmd, work);
 
 	loop_handle_cmd(cmd);
 }
@@ -1539,7 +1511,7 @@ static int loop_init_request(void *data, struct request *rq,
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
 	cmd->rq = rq;
-	INIT_WORK(&cmd->read_work, loop_queue_read_work);
+	init_kthread_work(&cmd->work, loop_queue_work);
 
 	return 0;
 }
@@ -1601,9 +1573,6 @@ static int loop_add(struct loop_device **l, int i)
 	 */
 	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
 
-	INIT_LIST_HEAD(&lo->write_cmd_head);
-	INIT_WORK(&lo->write_work, loop_queue_write_work);
-
 	disk = lo->lo_disk = alloc_disk(1 << part_shift);
 	if (!disk)
 		goto out_free_queue;
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 25e8997..b6c7d21 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -14,7 +14,7 @@
 #include <linux/blk-mq.h>
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
-#include <linux/workqueue.h>
+#include <linux/kthread.h>
 #include <uapi/linux/loop.h>
 
 /* Possible states of device */
@@ -54,12 +54,10 @@ struct loop_device {
 	gfp_t		old_gfp_mask;
 
 	spinlock_t		lo_lock;
-	struct workqueue_struct *wq;
-	struct list_head	write_cmd_head;
-	struct work_struct	write_work;
-	bool			write_started;
 	int			lo_state;
 	struct mutex		lo_ctl_mutex;
+	struct kthread_worker	worker;
+	struct task_struct	*worker_task;
 
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
@@ -67,7 +65,7 @@ struct loop_device {
 };
 
 struct loop_cmd {
-	struct work_struct read_work;
+	struct kthread_work work;
 	struct request *rq;
 	struct list_head list;
 };
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 4/5] block: loop: prepare for supporing direct IO
  2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
                   ` (2 preceding siblings ...)
  2015-05-29  6:35 ` [PATCH v4 3/5] block: loop: use kthread_work Ming Lei
@ 2015-05-29  6:35 ` Ming Lei
  2015-05-29  6:35 ` [PATCH v4 5/5] block: loop: support DIO & AIO Ming Lei
  2015-06-05  8:26 ` [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
  5 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2015-05-29  6:35 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner, Ming Lei

This patches provides two approaches for enabling direct IO
from user space:

	- userspace(such as losetup) can pass 'file' which is
	opened/fcntl as O_DIRECT
	- sysfs file is provided to run dio tests easily

Also __loop_update_dio() is introduced to check if direct I/O
can be used on current loop setting.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/block/loop.h |  1 +
 2 files changed, 77 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 555b895..0b1ee2b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -421,6 +421,44 @@ struct switch_request {
 	struct completion wait;
 };
 
+static void __loop_update_dio(struct loop_device *lo, bool dio)
+{
+	struct file *file = lo->lo_backing_file;
+	struct inode *inode = file->f_mapping->host;
+	bool use_dio;
+
+	/*
+	 * loop block's logical block size is 512, now
+	 * we support direct I/O only if the backing
+	 * block devices' minimize I/O size is 512.
+	 */
+	if (dio) {
+		if (inode->i_sb->s_bdev &&
+			bdev_io_min(inode->i_sb->s_bdev) == 512)
+			use_dio = true;
+		else
+			use_dio = false;
+	} else {
+		use_dio = false;
+	}
+
+	if (lo->use_dio == use_dio)
+		return;
+
+	/* flush dirty pages to avoid stale data from following dio */
+	if (use_dio)
+		vfs_fsync(file, 0);
+
+	blk_mq_freeze_queue(lo->lo_queue);
+	lo->use_dio = use_dio;
+	blk_mq_unfreeze_queue(lo->lo_queue);
+}
+
+static inline void loop_update_dio(struct loop_device *lo)
+{
+	__loop_update_dio(lo, io_is_direct(lo->lo_backing_file));
+}
+
 /*
  * Do the actual switch; called from the BIO completion routine
  */
@@ -441,6 +479,7 @@ static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
 		mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
 	lo->old_gfp_mask = mapping_gfp_mask(mapping);
 	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
+	loop_update_dio(lo);
 }
 
 /*
@@ -627,6 +666,41 @@ static ssize_t loop_attr_partscan_show(struct loop_device *lo, char *buf)
 	return sprintf(buf, "%s\n", partscan ? "1" : "0");
 }
 
+static ssize_t loop_attr_do_show_use_dio(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct loop_device *lo = disk->private_data;
+
+	return sprintf(buf, "%s\n", lo->use_dio ? "1" : "0");
+}
+
+ssize_t loop_attr_do_store_use_dio(struct device *dev,
+		struct device_attribute *attr, const char *buf,
+		size_t count)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct loop_device *lo = disk->private_data;
+	int err;
+	unsigned long v;
+
+	if (!lo->lo_backing_file)
+		return -ENODEV;
+
+	err = kstrtoul(buf, 10, &v);
+	if (err < 0)
+		return err;
+
+	__loop_update_dio(lo, !!v);
+	if  (lo->use_dio != !!v)
+		return -EINVAL;
+	return count;
+}
+
+static struct device_attribute loop_attr_use_dio =
+	__ATTR(use_dio, S_IRUGO | S_IWUSR, loop_attr_do_show_use_dio,
+			loop_attr_do_store_use_dio);
+
 LOOP_ATTR_RO(backing_file);
 LOOP_ATTR_RO(offset);
 LOOP_ATTR_RO(sizelimit);
@@ -639,6 +713,7 @@ static struct attribute *loop_attrs[] = {
 	&loop_attr_sizelimit.attr,
 	&loop_attr_autoclear.attr,
 	&loop_attr_partscan.attr,
+	&loop_attr_use_dio.attr,
 	NULL,
 };
 
@@ -783,6 +858,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
 		blk_queue_flush(lo->lo_queue, REQ_FLUSH);
 
+	loop_update_dio(lo);
 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
 	loop_sysfs_init(lo);
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index b6c7d21..d1de221 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -58,6 +58,7 @@ struct loop_device {
 	struct mutex		lo_ctl_mutex;
 	struct kthread_worker	worker;
 	struct task_struct	*worker_task;
+	bool			use_dio;
 
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 5/5] block: loop: support DIO & AIO
  2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
                   ` (3 preceding siblings ...)
  2015-05-29  6:35 ` [PATCH v4 4/5] block: loop: prepare for supporing direct IO Ming Lei
@ 2015-05-29  6:35 ` Ming Lei
  2015-06-05  8:26 ` [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
  5 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2015-05-29  6:35 UTC (permalink / raw)
  To: linux-kernel, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner, Ming Lei

There are about 3 advantages to use direct I/O and AIO on
read/write loop's backing file:

1) double cache can be avoided, then memory usage gets
decreased a lot

2) not like user space direct I/O, there isn't cost of
pinning pages

3) avoid context switch for obtaining good throughput
- in buffered file read, random I/O top throughput is often obtained
only if they are submitted concurrently from lots of tasks; but for
sequential I/O, most of times they can be hit from page cache, so
concurrent submissions often introduce unnecessary context switch
and can't improve throughput much. There was such discussion[1]
to use non-blocking I/O to improve the problem for application.
- with direct I/O and AIO, concurrent submissions can be
avoided and random read throughput can't be affected meantime

Follows my fio test result:

1. 16 jobs fio test inside ext4 file system over loop block
1) How to run
	- linux kernel: 4.1.0-rc2-next-20150506 with the patchset
	- the loop block is over one image on HDD.
	- linux psync, 16 jobs, size 400M, ext4 over loop block
	- test result: IOPS from fio output

2) Throughput result:
        -------------------------------------------------------------
        test cases          |randread   |read   |randwrite  |write  |
        -------------------------------------------------------------
        base                |240        |8705   |3763       |20914
        -------------------------------------------------------------
        base+loop aio       |242        |9258   |4577       |21451
        -------------------------------------------------------------

3) context switch
        - context switch decreased by ~16% with loop aio for randread,
	and decreased by ~33% for read

4) memory usage
	- After these four tests with loop aio: ~10% memory becomes used
	- After these four tests without loop aio: more than 55% memory
	becomes used

2. single job fio test inside ext4 file system over loop block(for Maxim Patlasov)
1) How to run
	- linux kernel: 4.1.0-rc2-next-20150506 with the patchset
	- the loop block is over one image on HDD.
	- linux psync, 1 job, size 4000M, ext4 over loop block
	- test result: IOPS from fio output

2) Throughput result:
        -------------------------------------------------------------
        test cases          |randread   |read   |randwrite  |write  |
        -------------------------------------------------------------
        base                |109        |21180  |4192       |22782
        -------------------------------------------------------------
        base+loop aio       |114        |21018  |5404       |22670
        -------------------------------------------------------------

3) context switch
        - context switch decreased by ~10% with loop aio for randread,
	and decreased by ~50% for read

4) memory usage
	- After these four tests with loop aio: ~10% memory becomes used
	- After these four tests without loop aio: more than 55% memory
	becomes used

Both 'context switch' and 'memory usage' data are got from sar.

[1] https://lwn.net/Articles/612483/
[2] sar graph when running fio over loop without the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-nonaio.pdf

[3] sar graph when running fio over loop with the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-aio.pdf

[4] sar graph when running fio over loop without the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-nonaio-1job.pdf

[5] sar graph when running fio over loop with the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-aio-1job.pdf

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++---
 drivers/block/loop.h |  2 ++
 2 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0b1ee2b..df0f712 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -388,6 +388,65 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq)
 
 	return ret;
 }
+static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
+	struct request *rq = cmd->rq;
+
+	if (ret > 0)
+		ret = 0;
+	else if (ret < 0)
+		ret = -EIO;
+
+	rq->errors = ret;
+	blk_mq_complete_request(rq);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     loff_t pos, bool rw)
+{
+	struct iov_iter iter;
+	struct bio_vec *bvec;
+	struct bio *bio = cmd->rq->bio;
+	struct file *file = lo->lo_backing_file;
+	int ret;
+
+	/* nomerge for loop request queue */
+	WARN_ON(cmd->rq->bio != cmd->rq->biotail);
+
+	bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	iov_iter_bvec(&iter, ITER_BVEC | rw, bvec,
+		      bio_segments(bio), blk_rq_bytes(cmd->rq));
+
+	cmd->iocb.ki_pos = pos;
+	cmd->iocb.ki_filp = file;
+	cmd->iocb.ki_complete = lo_rw_aio_complete;
+	cmd->iocb.ki_flags = IOCB_DONT_DIRTY_PAGE | IOCB_DIRECT;
+
+	if (rw == WRITE)
+		ret = file->f_op->write_iter(&cmd->iocb, &iter);
+	else
+		ret = file->f_op->read_iter(&cmd->iocb, &iter);
+
+	if (ret != -EIOCBQUEUED)
+		cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
+	return 0;
+}
+
+
+static inline int lo_rw_simple(struct loop_device *lo,
+		struct request *rq, loff_t pos, bool rw)
+{
+	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+
+	if (cmd->use_aio)
+		return lo_rw_aio(lo, cmd, pos, rw);
+
+	if (rw == WRITE)
+		return lo_write_simple(lo, rq, pos);
+	else
+		return lo_read_simple(lo, rq, pos);
+}
 
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
@@ -404,13 +463,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 		else if (lo->transfer)
 			ret = lo_write_transfer(lo, rq, pos);
 		else
-			ret = lo_write_simple(lo, rq, pos);
+			ret = lo_rw_simple(lo, rq, pos, WRITE);
 
 	} else {
 		if (lo->transfer)
 			ret = lo_read_transfer(lo, rq, pos);
 		else
-			ret = lo_read_simple(lo, rq, pos);
+			ret = lo_rw_simple(lo, rq, pos, READ);
 	}
 
 	return ret;
@@ -1550,6 +1609,12 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (lo->lo_state != Lo_bound)
 		return -EIO;
 
+	if (lo->use_dio && !lo->transfer &&
+			!(cmd->rq->cmd_flags & (REQ_FLUSH | REQ_DISCARD)))
+		cmd->use_aio = true;
+	else
+		cmd->use_aio = false;
+
 	queue_kthread_work(&lo->worker, &cmd->work);
 
 	return BLK_MQ_RQ_QUEUE_OK;
@@ -1569,7 +1634,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
  failed:
 	if (ret)
 		cmd->rq->errors = -EIO;
-	blk_mq_complete_request(cmd->rq);
+	if (!cmd->use_aio || ret)
+		blk_mq_complete_request(cmd->rq);
 }
 
 static void loop_queue_work(struct kthread_work *work)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index d1de221..fb2237c 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -69,6 +69,8 @@ struct loop_cmd {
 	struct kthread_work work;
 	struct request *rq;
 	struct list_head list;
+	bool use_aio;           /* use AIO interface to handle I/O */
+	struct kiocb iocb;
 };
 
 /* Support for loadable transfer modules */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 0/4] block: loop: improve loop with AIO
  2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
                   ` (4 preceding siblings ...)
  2015-05-29  6:35 ` [PATCH v4 5/5] block: loop: support DIO & AIO Ming Lei
@ 2015-06-05  8:26 ` Ming Lei
  5 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2015-06-05  8:26 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Dave Kleikamp
  Cc: Jens Axboe, Zach Brown, Christoph Hellwig, Maxim Patlasov,
	Andrew Morton, Alexander Viro, Tejun Heo, Dave Chinner

Hi Guys,

On Fri, May 29, 2015 at 2:35 PM, Ming Lei <ming.lei@canonical.com> wrote:
> Hi Guys,
>
> There are about 3 advantages to use direct I/O and AIO on
> read/write loop's backing file:
>
> 1) double cache can be avoided, then memory usage gets
> decreased a lot
>
> 2) not like user space direct I/O, there isn't cost of
> pinning pages
>
> 3) avoid context switch for obtaining good throughput
> - in buffered file read, random I/O throughput is often obtained
> only if they are submitted concurrently from lots of tasks; but for
> sequential I/O, most of times they can be hit from page cache, so
> concurrent submissions often introduce unnecessary context switch
> and can't improve throughput much. There was such discussion[1]
> to use non-blocking I/O to improve the problem for application.
> - with direct I/O and AIO, concurrent submissions can be
> avoided and random read throughput can't be affected meantime
>
> So this patchset trys to improve loop via AIO, and about 45% memory
> usage can be decreased, see detailed data in commit log of patch4,
> also IO throughput isn't affected too.
>
> V4:
>         - add detailed commit log for 'use kthread_work'
>         - allow userspace(sysfs, losetup) to decide if dio/aio is
>         used as suggested by Christoph and Dave Chinner
>         - only use dio if the backing block device's min io size
>         is 512 as pointed by Dave Chinner & Christoph

Gentle ping, :-)

>
> V3:
>         - based on Al's iov_iter work and Christoph's kiocb changes
>         - use kthread_work
>         - introduce IOCB_DONT_DIRTY_PAGE flag
>         - set QUEUE_FLAG_NOMERGES for loop's request queue
> V2:
>         - remove 'extra' parameter to aio_kernel_alloc()
>         - try to avoid memory allcation inside queue req callback
>         - introduce 'use_mq' sysfs file for enabling kernel aio or disabling it
> V1:
>         - link:
>                 http://marc.info/?t=140803157700004&r=1&w=2
>         - improve failure path in aio_kernel_submit()
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO
  2015-05-29  6:35 ` [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO Ming Lei
@ 2015-06-05 15:03   ` Christoph Hellwig
  2015-06-06  0:42     ` Ming Lei
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2015-06-05 15:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Dave Kleikamp, Jens Axboe, Zach Brown,
	Christoph Hellwig, Maxim Patlasov, Andrew Morton, Alexander Viro,
	Tejun Heo, Dave Chinner

As mentioned last time a big fat NAK for this one.  We generally
do not dirty kernel pages anywhere, so a flag that must alway be set
for default behavior just to prepare for a highy hypothetical user that
in the future might want to abuse the interface is a no-go.

Just make the dirtying behavior dependent on the iov_iter type:
dirty for ITER_IOVEC and do nothing for ITER_KVEC and ITER_BVEC.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO
  2015-06-05 15:03   ` Christoph Hellwig
@ 2015-06-06  0:42     ` Ming Lei
  2015-06-09  6:29       ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2015-06-06  0:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Tejun Heo,
	Dave Chinner

On Fri, Jun 5, 2015 at 11:03 PM, Christoph Hellwig <hch@infradead.org> wrote:
> As mentioned last time a big fat NAK for this one.  We generally
> do not dirty kernel pages anywhere, so a flag that must alway be set
> for default behavior just to prepare for a highy hypothetical user that
> in the future might want to abuse the interface is a no-go.
>
> Just make the dirtying behavior dependent on the iov_iter type:
> dirty for ITER_IOVEC and do nothing for ITER_KVEC and ITER_BVEC.

Both ITER_KVEC and ITER_BVEC doesn't mean the pages are kernel
page, for example of loop and swap. That is why this patch is more flexiable,
and won't cause regression since the users may have different dirtying
rules as you mentioned last time.

http://marc.info/?t=143093223200001&r=1&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO
  2015-06-06  0:42     ` Ming Lei
@ 2015-06-09  6:29       ` Christoph Hellwig
  2015-06-09  9:58         ` Ming Lei
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2015-06-09  6:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Linux Kernel Mailing List, Dave Kleikamp,
	Jens Axboe, Zach Brown, Maxim Patlasov, Andrew Morton,
	Alexander Viro, Tejun Heo, Dave Chinner

On Sat, Jun 06, 2015 at 08:42:33AM +0800, Ming Lei wrote:
> Both ITER_KVEC and ITER_BVEC doesn't mean the pages are kernel
> page, for example of loop and swap. That is why this patch is more flexiable,
> and won't cause regression since the users may have different dirtying
> rules as you mentioned last time.
> 
> http://marc.info/?t=143093223200001&r=1&w=2

Again, we never dirty pages for the caller when doing general purpose
kernel I/O.  For loop we either don't need it (totally in-kernel block
I/O) or the caller takes care of it (direct I/O on the loop device).

The swap code currently only uses ->direct_IO and thus isn't affected
by this flag.  If we use it for write the low-level I/O code should not
dirty the page either.

So to repeat myself: for the current state of affairs adding a flag
that every sensible user has to set is a horrible interface.  If for
some unforseen reason we'll need the flag later on it should have
reverse polarity, and only be added when needed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO
  2015-06-09  6:29       ` Christoph Hellwig
@ 2015-06-09  9:58         ` Ming Lei
  0 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2015-06-09  9:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Kernel Mailing List, Dave Kleikamp, Jens Axboe, Zach Brown,
	Maxim Patlasov, Andrew Morton, Alexander Viro, Tejun Heo,
	Dave Chinner

On Tue, Jun 9, 2015 at 2:29 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Sat, Jun 06, 2015 at 08:42:33AM +0800, Ming Lei wrote:
>> Both ITER_KVEC and ITER_BVEC doesn't mean the pages are kernel
>> page, for example of loop and swap. That is why this patch is more flexiable,
>> and won't cause regression since the users may have different dirtying
>> rules as you mentioned last time.
>>
>> http://marc.info/?t=143093223200001&r=1&w=2
>
> Again, we never dirty pages for the caller when doing general purpose
> kernel I/O.  For loop we either don't need it (totally in-kernel block
> I/O) or the caller takes care of it (direct I/O on the loop device).
>
> The swap code currently only uses ->direct_IO and thus isn't affected
> by this flag.  If we use it for write the low-level I/O code should not
> dirty the page either.
>
> So to repeat myself: for the current state of affairs adding a flag
> that every sensible user has to set is a horrible interface.  If for

There shouldn't be lots of such uses, and the flag still has the
document benifit, that means the caller should think and check
the current dirtying usage.

> some unforseen reason we'll need the flag later on it should have
> reverse polarity, and only be added when needed.

OK, I will remove the flag in v5 since loop dio is the 1st read
->direct_IO via ITER_KVEC/ITER_BVEC inside kernel.

But I am wondering it is good to decide dirtying by
ITER_KVEC and ITER_BVEC, and dirtying pages should better
be checked case by case, at least for loop, the usage is a bit
special(fs over loop is taking care of that)


Thanks,
Ming

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-09  9:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-29  6:35 [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei
2015-05-29  6:35 ` [PATCH v4 1/5] fs: kiocb: introduce IOCB_DONT_DIRTY_PAGE flag for direct IO Ming Lei
2015-06-05 15:03   ` Christoph Hellwig
2015-06-06  0:42     ` Ming Lei
2015-06-09  6:29       ` Christoph Hellwig
2015-06-09  9:58         ` Ming Lei
2015-05-29  6:35 ` [PATCH v4 2/5] block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop Ming Lei
2015-05-29  6:35 ` [PATCH v4 3/5] block: loop: use kthread_work Ming Lei
2015-05-29  6:35 ` [PATCH v4 4/5] block: loop: prepare for supporing direct IO Ming Lei
2015-05-29  6:35 ` [PATCH v4 5/5] block: loop: support DIO & AIO Ming Lei
2015-06-05  8:26 ` [PATCH v4 0/4] block: loop: improve loop with AIO Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).