linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] Remove rw_page
@ 2017-08-08  6:50 Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 1/6] bdi: introduce BDI_CAP_SYNC Minchan Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Ross Zwisler, karam . lee, seungho1.park,
	Matthew Wilcox, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team,
	Minchan Kim

Recently, there was a dicussion about removing rw_page due to maintainance
burden[1] but the problem was zram because zram has a clear win for the
benchmark at that time. The reason why only zram have a win is due to
bio allocation wait time from mempool under extreme memory pressure.

Christoph Hellwig suggested we can use on-stack-bio for rw_page devices.
This patch implements it and replace rw_page operations with on-stack-bio
and then finally, remove rw_page interface completely.

This patch is based on linux-next-20170804

[1] http://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>

Minchan Kim (6):
  bdi: introduce BDI_CAP_SYNC
  fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  mm:swap: remove end_swap_bio_write argument
  mm:swap: use on-stack-bio for BDI_CAP_SYNC devices
  zram: remove zram_rw_page
  fs: remove rw_page

 drivers/block/brd.c           |   2 +
 drivers/block/zram/zram_drv.c |  54 +---------------
 drivers/nvdimm/btt.c          |   2 +
 drivers/nvdimm/pmem.c         |   2 +
 fs/block_dev.c                |  76 ----------------------
 fs/mpage.c                    |  45 +++++++++++--
 include/linux/backing-dev.h   |   7 ++
 include/linux/blkdev.h        |   4 --
 include/linux/swap.h          |   6 +-
 mm/page_io.c                  | 145 +++++++++++++++++++++++++-----------------
 mm/swapfile.c                 |   3 +
 mm/zswap.c                    |   2 +-
 12 files changed, 147 insertions(+), 201 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v1 1/6] bdi: introduce BDI_CAP_SYNC
  2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
@ 2017-08-08  6:50 ` Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability Minchan Kim
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, jack, linux-nvdimm, Dave Chinner, Minchan Kim,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, linux-mm,
	kernel-team, seungho1.park, karam . lee

By discussion[1], we will replace rw_page devices with on-stack-bio.
For such super-fast devices to be detected, this patch introduces
BDI_CAP_SYNC which means synchronous IO would be more efficient for
asnychronous IO and uses the flags to brd, zram, btt and pmem.

[1] lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/brd.c           | 2 ++
 drivers/block/zram/zram_drv.c | 2 +-
 drivers/nvdimm/btt.c          | 2 ++
 drivers/nvdimm/pmem.c         | 2 ++
 include/linux/backing-dev.h   | 7 +++++++
 5 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 293250582f00..97d4e1679de7 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -20,6 +20,7 @@
 #include <linux/radix-tree.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/backing-dev.h>
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 #include <linux/pfn_t.h>
 #include <linux/dax.h>
@@ -436,6 +437,7 @@ static struct brd_device *brd_alloc(int i)
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "ram%d", i);
 	set_capacity(disk, rd_size * 2);
+	disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNC;
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bbbc2f230b8e..3eda88d0ca95 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1577,7 +1577,7 @@ static int zram_add(void)
 		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	zram->disk->queue->backing_dev_info->capabilities |=
-					BDI_CAP_STABLE_WRITES;
+				(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNC);
 	add_disk(zram->disk);
 
 	ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index e10d3300b64c..16f60351e4fd 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -23,6 +23,7 @@
 #include <linux/ndctl.h>
 #include <linux/fs.h>
 #include <linux/nd.h>
+#include <linux/backing-dev.h>
 #include "btt.h"
 #include "nd.h"
 
@@ -1273,6 +1274,7 @@ static int btt_blk_init(struct btt *btt)
 	btt->btt_disk->private_data = btt;
 	btt->btt_disk->queue = btt->btt_queue;
 	btt->btt_disk->flags = GENHD_FL_EXT_DEVT;
+	btt->btt_disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNC;
 
 	blk_queue_make_request(btt->btt_queue, btt_make_request);
 	blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b5f04559a497..e1704099b5cc 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -31,6 +31,7 @@
 #include <linux/uio.h>
 #include <linux/dax.h>
 #include <linux/nd.h>
+#include <linux/backing-dev.h>
 #include "pmem.h"
 #include "pfn.h"
 #include "nd.h"
@@ -379,6 +380,7 @@ static int pmem_attach_disk(struct device *dev,
 	disk->fops		= &pmem_fops;
 	disk->queue		= q;
 	disk->flags		= GENHD_FL_EXT_DEVT;
+	disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNC;
 	nvdimm_namespace_disk_name(ndns, disk->disk_name);
 	set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
 			/ 512);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..397ee71763d7 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -123,6 +123,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
  * BDI_CAP_STRICTLIMIT:    Keep number of dirty pages below bdi threshold.
  *
  * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
+ * BDI_CAP_SYNC: Device is so fast that asynchronous IO would be inefficient.
  */
 #define BDI_CAP_NO_ACCT_DIRTY	0x00000001
 #define BDI_CAP_NO_WRITEBACK	0x00000002
@@ -130,6 +131,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_STABLE_WRITES	0x00000008
 #define BDI_CAP_STRICTLIMIT	0x00000010
 #define BDI_CAP_CGROUP_WRITEBACK 0x00000020
+#define BDI_CAP_SYNC		0x00000040
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
 	(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -177,6 +179,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
+static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_SYNC;
+}
+
 static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
 {
 	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 1/6] bdi: introduce BDI_CAP_SYNC Minchan Kim
@ 2017-08-08  6:50 ` Minchan Kim
  2017-08-08 12:49   ` Matthew Wilcox
  2017-08-08  6:50 ` [PATCH v1 3/6] mm:swap: remove end_swap_bio_write argument Minchan Kim
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Ross Zwisler, karam . lee, seungho1.park,
	Matthew Wilcox, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team,
	Minchan Kim

There is no need to use dynamic bio allocation for BDI_CAP_SYNC
devices. They can with on-stack-bio without concern about waiting
bio allocation from mempool under heavy memory pressure.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 fs/mpage.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/fs/mpage.c b/fs/mpage.c
index 2e4c41ccb5c9..eaeaef27d693 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -31,6 +31,14 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void on_stack_page_end_io(struct bio *bio)
+{
+	struct page *page = bio->bi_io_vec->bv_page;
+
+	page_endio(page, op_is_write(bio_op(bio)),
+		blk_status_to_errno(bio->bi_status));
+}
+
 /*
  * I/O completion handler for multipage BIOs.
  *
@@ -278,6 +286,22 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 alloc_new:
 	if (bio == NULL) {
 		if (first_hole == blocks_per_page) {
+			if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
+				/* on-stack-bio */
+				struct bio sbio;
+				struct bio_vec bvec;
+
+				bio_init(&sbio, &bvec, 1);
+				sbio.bi_bdev = bdev;
+				sbio.bi_iter.bi_sector =
+					blocks[0] << (blkbits - 9);
+				sbio.bi_end_io = on_stack_page_end_io;
+				bio_add_page(&sbio, page, PAGE_SIZE, 0);
+				bio_set_op_attrs(&sbio, REQ_OP_READ, 0);
+				submit_bio(&sbio);
+				goto out;
+			}
+
 			if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
 								page))
 				goto out;
@@ -604,6 +628,25 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 alloc_new:
 	if (bio == NULL) {
 		if (first_unmapped == blocks_per_page) {
+			if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
+				/* on-stack-bio */
+				struct bio sbio;
+				struct bio_vec bvec;
+
+				bio_init(&sbio, &bvec, 1);
+				sbio.bi_bdev = bdev;
+				sbio.bi_iter.bi_sector =
+					blocks[0] << (blkbits - 9);
+				sbio.bi_end_io = on_stack_page_end_io;
+				bio_add_page(&sbio, page, PAGE_SIZE, 0);
+				bio_set_op_attrs(&sbio, REQ_OP_WRITE, op_flags);
+				WARN_ON_ONCE(PageWriteback(page));
+				set_page_writeback(page);
+				unlock_page(page);
+				submit_bio(&sbio);
+				clean_buffers(page, first_unmapped);
+			}
+
 			if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
 								page, wbc)) {
 				clean_buffers(page, first_unmapped);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 3/6] mm:swap: remove end_swap_bio_write argument
  2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 1/6] bdi: introduce BDI_CAP_SYNC Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability Minchan Kim
@ 2017-08-08  6:50 ` Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 4/6] mm:swap: use on-stack-bio for BDI_CAP_SYNC devices Minchan Kim
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, jack, linux-nvdimm, Dave Chinner, Minchan Kim,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, linux-mm,
	kernel-team, seungho1.park, karam . lee

Every caller of __swap_writepage uses end_swap_bio_write as
end_write_func argument so the argument is pointless.
Remove it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h | 3 +--
 mm/page_io.c         | 7 +++----
 mm/zswap.c           | 2 +-
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 76f1632eea5a..ae3da979a7b7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -336,8 +336,7 @@ extern void kswapd_stop(int nid);
 extern int swap_readpage(struct page *page, bool do_poll);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 extern void end_swap_bio_write(struct bio *bio);
-extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
-	bio_end_io_t end_write_func);
+extern int __swap_writepage(struct page *page, struct writeback_control *wbc);
 extern int swap_set_page_dirty(struct page *page);
 
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
diff --git a/mm/page_io.c b/mm/page_io.c
index 20139b90125a..3502a97f7c48 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -254,7 +254,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		end_page_writeback(page);
 		goto out;
 	}
-	ret = __swap_writepage(page, wbc, end_swap_bio_write);
+	ret = __swap_writepage(page, wbc);
 out:
 	return ret;
 }
@@ -273,8 +273,7 @@ static inline void count_swpout_vm_event(struct page *page)
 	count_vm_events(PSWPOUT, hpage_nr_pages(page));
 }
 
-int __swap_writepage(struct page *page, struct writeback_control *wbc,
-		bio_end_io_t end_write_func)
+int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct bio *bio;
 	int ret;
@@ -329,7 +328,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_NOIO, page, end_write_func);
+	bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
 	if (bio == NULL) {
 		set_page_dirty(page);
 		unlock_page(page);
diff --git a/mm/zswap.c b/mm/zswap.c
index d39581a076c3..38db258515b5 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -900,7 +900,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
 	SetPageReclaim(page);
 
 	/* start writeback */
-	__swap_writepage(page, &wbc, end_swap_bio_write);
+	__swap_writepage(page, &wbc);
 	put_page(page);
 	zswap_written_back_pages++;
 
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 4/6] mm:swap: use on-stack-bio for BDI_CAP_SYNC devices
  2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
                   ` (2 preceding siblings ...)
  2017-08-08  6:50 ` [PATCH v1 3/6] mm:swap: remove end_swap_bio_write argument Minchan Kim
@ 2017-08-08  6:50 ` Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 5/6] zram: remove zram_rw_page Minchan Kim
  2017-08-08  6:50 ` [PATCH v1 6/6] fs: remove rw_page Minchan Kim
  5 siblings, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, jack, linux-nvdimm, Dave Chinner, Minchan Kim,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, linux-mm,
	kernel-team, seungho1.park, karam . lee

There is no need to use dynamic bio allocation for BDI_CAP_SYNC
devices. They can live with on-stack-bio without concern about
waiting bio allocation from mempool under heavy memory pressure.

It would be much better for swap devices because the bio mempool
for swap IO have been used with fs. It means super-fast swap
IO like zram don't need to depends on slow eMMC read/write
completion.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |   3 +-
 mm/page_io.c         | 123 +++++++++++++++++++++++++++++++++++----------------
 mm/swapfile.c        |   3 ++
 3 files changed, 89 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ae3da979a7b7..6ed9b6423f7d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -152,8 +152,9 @@ enum {
 	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
+	SWP_SYNC_IO	= (1<<11),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 11),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index 3502a97f7c48..d794fd810773 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -44,7 +44,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 	return bio;
 }
 
-void end_swap_bio_write(struct bio *bio)
+void end_swap_bio_write_simple(struct bio *bio)
 {
 	struct page *page = bio->bi_io_vec[0].bv_page;
 
@@ -66,6 +66,11 @@ void end_swap_bio_write(struct bio *bio)
 		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
+}
+
+void end_swap_bio_write(struct bio *bio)
+{
+	end_swap_bio_write_simple(bio);
 	bio_put(bio);
 }
 
@@ -117,10 +122,9 @@ static void swap_slot_free_notify(struct page *page)
 	}
 }
 
-static void end_swap_bio_read(struct bio *bio)
+static void end_swap_bio_read_simple(struct bio *bio)
 {
 	struct page *page = bio->bi_io_vec[0].bv_page;
-	struct task_struct *waiter = bio->bi_private;
 
 	if (bio->bi_status) {
 		SetPageError(page);
@@ -136,6 +140,13 @@ static void end_swap_bio_read(struct bio *bio)
 	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
+}
+
+static void end_swap_bio_read(struct bio *bio)
+{
+	struct task_struct *waiter = bio->bi_private;
+
+	end_swap_bio_read_simple(bio);
 	WRITE_ONCE(bio->bi_private, NULL);
 	bio_put(bio);
 	wake_up_process(waiter);
@@ -275,7 +286,6 @@ static inline void count_swpout_vm_event(struct page *page)
 
 int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct bio *bio;
 	int ret;
 	struct swap_info_struct *sis = page_swap_info(page);
 
@@ -328,25 +338,43 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
-	if (bio == NULL) {
-		set_page_dirty(page);
+	if (!(sis->flags & SWP_SYNC_IO)) {
+		struct bio *bio;
+
+		bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
+		if (bio == NULL) {
+			set_page_dirty(page);
+			unlock_page(page);
+			ret = -ENOMEM;
+			goto out;
+		}
+		bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+		set_page_writeback(page);
 		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
+		submit_bio(bio);
+	} else {
+
+		/* on-stack-bio */
+		struct bio sbio;
+		struct bio_vec bvec;
+
+		bio_init(&sbio, &bvec, 1);
+		sbio.bi_bdev = sis->bdev;
+		sbio.bi_iter.bi_sector = swap_page_sector(page);
+		sbio.bi_end_io = end_swap_bio_write_simple;
+		bio_add_page(&sbio, page, PAGE_SIZE, 0);
+		bio_set_op_attrs(&sbio, REQ_OP_WRITE, wbc_to_write_flags(wbc));
+		set_page_writeback(page);
+		unlock_page(page);
+		submit_bio(&sbio);
 	}
-	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 	count_swpout_vm_event(page);
-	set_page_writeback(page);
-	unlock_page(page);
-	submit_bio(bio);
 out:
 	return ret;
 }
 
 int swap_readpage(struct page *page, bool do_poll)
 {
-	struct bio *bio;
 	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
 	blk_qc_t qc;
@@ -383,33 +411,50 @@ int swap_readpage(struct page *page, bool do_poll)
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
-	if (bio == NULL) {
-		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
-	}
-	bdev = bio->bi_bdev;
-	/*
-	 * Keep this task valid during swap readpage because the oom killer may
-	 * attempt to access it in the page fault retry time check.
-	 */
-	get_task_struct(current);
-	bio->bi_private = current;
-	bio_set_op_attrs(bio, REQ_OP_READ, 0);
 	count_vm_event(PSWPIN);
-	bio_get(bio);
-	qc = submit_bio(bio);
-	while (do_poll) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio->bi_private))
-			break;
-
-		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
-			break;
+	if (!(sis->flags & SWP_SYNC_IO)) {
+		struct bio *bio;
+
+		bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
+		if (bio == NULL) {
+			unlock_page(page);
+			ret = -ENOMEM;
+			goto out;
+		}
+		bdev = bio->bi_bdev;
+		/*
+		 * Keep this task valid during swap readpage because
+		 * the oom killer may attempt to access it
+		 * in the page fault retry time check.
+		 */
+		get_task_struct(current);
+		bio->bi_private = current;
+		bio_set_op_attrs(bio, REQ_OP_READ, 0);
+		bio_get(bio);
+		qc = submit_bio(bio);
+		while (do_poll) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(bio->bi_private))
+				break;
+
+			if (!blk_mq_poll(bdev_get_queue(bdev), qc))
+				break;
+		}
+		__set_current_state(TASK_RUNNING);
+		bio_put(bio);
+	} else {
+		/* on-stack-bio */
+		struct bio sbio;
+		struct bio_vec bvec;
+
+		bio_init(&sbio, &bvec, 1);
+		sbio.bi_bdev = sis->bdev;
+		sbio.bi_iter.bi_sector = swap_page_sector(page);
+		sbio.bi_end_io = end_swap_bio_read_simple;
+		bio_add_page(&sbio, page, PAGE_SIZE, 0);
+		bio_set_op_attrs(&sbio, REQ_OP_READ, 0);
+		submit_bio(&sbio);
 	}
-	__set_current_state(TASK_RUNNING);
-	bio_put(bio);
 
 out:
 	return ret;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42eff9e4e972..e916b325b0b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3113,6 +3113,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
 		p->flags |= SWP_STABLE_WRITES;
 
+	if (bdi_cap_synchronous_io(inode_to_bdi(inode)))
+		p->flags |= SWP_SYNC_IO;
+
 	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
 		int cpu;
 		unsigned long ci, nr_cluster;
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 5/6] zram: remove zram_rw_page
  2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
                   ` (3 preceding siblings ...)
  2017-08-08  6:50 ` [PATCH v1 4/6] mm:swap: use on-stack-bio for BDI_CAP_SYNC devices Minchan Kim
@ 2017-08-08  6:50 ` Minchan Kim
  2017-08-08  7:02   ` Sergey Senozhatsky
  2017-08-08  6:50 ` [PATCH v1 6/6] fs: remove rw_page Minchan Kim
  5 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Sergey Senozhatsky, jack, linux-nvdimm, Dave Chinner,
	Minchan Kim, linux-kernel, Matthew Wilcox, Christoph Hellwig,
	linux-mm, kernel-team, seungho1.park, karam . lee

With on-stack-bio, rw_page interface doesn't provide a clear performance
benefit for zram and surely has a maintenance burden, so remove the
last user to remove rw_page completely.

Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 52 -------------------------------------------
 1 file changed, 52 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3eda88d0ca95..9620163308fa 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1268,57 +1268,6 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	atomic64_inc(&zram->stats.notify_free);
 }
 
-static int zram_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, bool is_write)
-{
-	int offset, ret;
-	u32 index;
-	struct zram *zram;
-	struct bio_vec bv;
-
-	if (PageTransHuge(page))
-		return -ENOTSUPP;
-	zram = bdev->bd_disk->private_data;
-
-	if (!valid_io_request(zram, sector, PAGE_SIZE)) {
-		atomic64_inc(&zram->stats.invalid_io);
-		ret = -EINVAL;
-		goto out;
-	}
-
-	index = sector >> SECTORS_PER_PAGE_SHIFT;
-	offset = (sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
-
-	bv.bv_page = page;
-	bv.bv_len = PAGE_SIZE;
-	bv.bv_offset = 0;
-
-	ret = zram_bvec_rw(zram, &bv, index, offset, is_write, NULL);
-out:
-	/*
-	 * If I/O fails, just return error(ie, non-zero) without
-	 * calling page_endio.
-	 * It causes resubmit the I/O with bio request by upper functions
-	 * of rw_page(e.g., swap_readpage, __swap_writepage) and
-	 * bio->bi_end_io does things to handle the error
-	 * (e.g., SetPageError, set_page_dirty and extra works).
-	 */
-	if (unlikely(ret < 0))
-		return ret;
-
-	switch (ret) {
-	case 0:
-		page_endio(page, is_write, 0);
-		break;
-	case 1:
-		ret = 0;
-		break;
-	default:
-		WARN_ON(1);
-	}
-	return ret;
-}
-
 static void zram_reset_device(struct zram *zram)
 {
 	struct zcomp *comp;
@@ -1460,7 +1409,6 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
 static const struct block_device_operations zram_devops = {
 	.open = zram_open,
 	.swap_slot_free_notify = zram_slot_free_notify,
-	.rw_page = zram_rw_page,
 	.owner = THIS_MODULE
 };
 
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 6/6] fs: remove rw_page
  2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
                   ` (4 preceding siblings ...)
  2017-08-08  6:50 ` [PATCH v1 5/6] zram: remove zram_rw_page Minchan Kim
@ 2017-08-08  6:50 ` Minchan Kim
  5 siblings, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  6:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, jack, linux-nvdimm, Dave Chinner, Minchan Kim,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, linux-mm,
	kernel-team, seungho1.park, karam . lee

Currently, there is no user of rw_page so remove it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 fs/block_dev.c         | 76 --------------------------------------------------
 fs/mpage.c             | 12 ++------
 include/linux/blkdev.h |  4 ---
 mm/page_io.c           | 17 -----------
 4 files changed, 2 insertions(+), 107 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9941dc8342df..6fb408041e7d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -649,82 +649,6 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 }
 EXPORT_SYMBOL(blkdev_fsync);
 
-/**
- * bdev_read_page() - Start reading a page from a block device
- * @bdev: The device to read the page from
- * @sector: The offset on the device to read the page to (need not be aligned)
- * @page: The page to read
- *
- * On entry, the page should be locked.  It will be unlocked when the page
- * has been read.  If the block driver implements rw_page synchronously,
- * that will be true on exit from this function, but it need not be.
- *
- * Errors returned by this function are usually "soft", eg out of memory, or
- * queue full; callers should try a different route to read this page rather
- * than propagate an error back up the stack.
- *
- * Return: negative errno if an error occurs, 0 if submission was successful.
- */
-int bdev_read_page(struct block_device *bdev, sector_t sector,
-			struct page *page)
-{
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	int result = -EOPNOTSUPP;
-
-	if (!ops->rw_page || bdev_get_integrity(bdev))
-		return result;
-
-	result = blk_queue_enter(bdev->bd_queue, false);
-	if (result)
-		return result;
-	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
-	blk_queue_exit(bdev->bd_queue);
-	return result;
-}
-EXPORT_SYMBOL_GPL(bdev_read_page);
-
-/**
- * bdev_write_page() - Start writing a page to a block device
- * @bdev: The device to write the page to
- * @sector: The offset on the device to write the page to (need not be aligned)
- * @page: The page to write
- * @wbc: The writeback_control for the write
- *
- * On entry, the page should be locked and not currently under writeback.
- * On exit, if the write started successfully, the page will be unlocked and
- * under writeback.  If the write failed already (eg the driver failed to
- * queue the page to the device), the page will still be locked.  If the
- * caller is a ->writepage implementation, it will need to unlock the page.
- *
- * Errors returned by this function are usually "soft", eg out of memory, or
- * queue full; callers should try a different route to write this page rather
- * than propagate an error back up the stack.
- *
- * Return: negative errno if an error occurs, 0 if submission was successful.
- */
-int bdev_write_page(struct block_device *bdev, sector_t sector,
-			struct page *page, struct writeback_control *wbc)
-{
-	int result;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-
-	if (!ops->rw_page || bdev_get_integrity(bdev))
-		return -EOPNOTSUPP;
-	result = blk_queue_enter(bdev->bd_queue, false);
-	if (result)
-		return result;
-
-	set_page_writeback(page);
-	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, true);
-	if (result)
-		end_page_writeback(page);
-	else
-		unlock_page(page);
-	blk_queue_exit(bdev->bd_queue);
-	return result;
-}
-EXPORT_SYMBOL_GPL(bdev_write_page);
-
 /*
  * pseudo-fs
  */
diff --git a/fs/mpage.c b/fs/mpage.c
index eaeaef27d693..707d77fe7289 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -301,11 +301,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 				submit_bio(&sbio);
 				goto out;
 			}
-
-			if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
-								page))
-				goto out;
 		}
+
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
 				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
 		if (bio == NULL)
@@ -646,13 +643,8 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 				submit_bio(&sbio);
 				clean_buffers(page, first_unmapped);
 			}
-
-			if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
-								page, wbc)) {
-				clean_buffers(page, first_unmapped);
-				goto out;
-			}
 		}
+
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
 				BIO_MAX_PAGES, GFP_NOFS|__GFP_HIGH);
 		if (bio == NULL)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..21fffa849033 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1936,7 +1936,6 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
-	int (*rw_page)(struct block_device *, sector_t, struct page *, bool);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	unsigned int (*check_events) (struct gendisk *disk,
@@ -1954,9 +1953,6 @@ struct block_device_operations {
 
 extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 				 unsigned long);
-extern int bdev_read_page(struct block_device *, sector_t, struct page *);
-extern int bdev_write_page(struct block_device *, sector_t, struct page *,
-						struct writeback_control *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;
diff --git a/mm/page_io.c b/mm/page_io.c
index d794fd810773..1cbbac7b852a 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -331,12 +331,6 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 		return ret;
 	}
 
-	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
-	if (!ret) {
-		count_swpout_vm_event(page);
-		return 0;
-	}
-
 	ret = 0;
 	if (!(sis->flags & SWP_SYNC_IO)) {
 		struct bio *bio;
@@ -399,17 +393,6 @@ int swap_readpage(struct page *page, bool do_poll)
 		return ret;
 	}
 
-	ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
-	if (!ret) {
-		if (trylock_page(page)) {
-			swap_slot_free_notify(page);
-			unlock_page(page);
-		}
-
-		count_vm_event(PSWPIN);
-		return 0;
-	}
-
 	ret = 0;
 	count_vm_event(PSWPIN);
 	if (!(sis->flags & SWP_SYNC_IO)) {
-- 
2.7.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 5/6] zram: remove zram_rw_page
  2017-08-08  6:50 ` [PATCH v1 5/6] zram: remove zram_rw_page Minchan Kim
@ 2017-08-08  7:02   ` Sergey Senozhatsky
  2017-08-08  8:13     ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Sergey Senozhatsky @ 2017-08-08  7:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Matthew Wilcox, Christoph Hellwig, Dan Williams,
	Dave Chinner, jack, Jens Axboe, Vishal Verma, linux-nvdimm,
	kernel-team, Sergey Senozhatsky

On (08/08/17 15:50), Minchan Kim wrote:
> With on-stack-bio, rw_page interface doesn't provide a clear performance
> benefit for zram and surely has a maintenance burden, so remove the
> last user to remove rw_page completely.

OK, never really liked it, I think we had that conversation before.

as far as I remember, zram_rw_page() was the reason we had to do some
tricks with init_lock to make lockdep happy. may be now we can "simplify"
the things back.


> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 5/6] zram: remove zram_rw_page
  2017-08-08  7:02   ` Sergey Senozhatsky
@ 2017-08-08  8:13     ` Minchan Kim
  2017-08-08  8:23       ` Sergey Senozhatsky
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-08  8:13 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Jens Axboe, Sergey Senozhatsky, jack, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, linux-mm,
	kernel-team, seungho1.park, karam . lee, Andrew Morton

Hi Sergey,

On Tue, Aug 08, 2017 at 04:02:26PM +0900, Sergey Senozhatsky wrote:
> On (08/08/17 15:50), Minchan Kim wrote:
> > With on-stack-bio, rw_page interface doesn't provide a clear performance
> > benefit for zram and surely has a maintenance burden, so remove the
> > last user to remove rw_page completely.
> 
> OK, never really liked it, I think we had that conversation before.
> 
> as far as I remember, zram_rw_page() was the reason we had to do some
> tricks with init_lock to make lockdep happy. may be now we can "simplify"
> the things back.

I cannot remember. Blame my brain. ;-)

Anyway, it's always welcome to make thing simple.
Could you send a patch after settle down this patchset?

> 
> 
> > Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Thanks for the review!
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 5/6] zram: remove zram_rw_page
  2017-08-08  8:13     ` Minchan Kim
@ 2017-08-08  8:23       ` Sergey Senozhatsky
  2017-08-08 15:48         ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Sergey Senozhatsky @ 2017-08-08  8:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Sergey Senozhatsky, jack, Sergey Senozhatsky,
	linux-nvdimm, Dave Chinner, linux-kernel, Matthew Wilcox,
	Christoph Hellwig, linux-mm, kernel-team, seungho1.park,
	Andrew Morton, karam . lee

Hello Minchan,

On (08/08/17 17:13), Minchan Kim wrote:
> Hi Sergey,
> 
> On Tue, Aug 08, 2017 at 04:02:26PM +0900, Sergey Senozhatsky wrote:
> > On (08/08/17 15:50), Minchan Kim wrote:
> > > With on-stack-bio, rw_page interface doesn't provide a clear performance
> > > benefit for zram and surely has a maintenance burden, so remove the
> > > last user to remove rw_page completely.
> > 
> > OK, never really liked it, I think we had that conversation before.
> > 
> > as far as I remember, zram_rw_page() was the reason we had to do some
> > tricks with init_lock to make lockdep happy. may be now we can "simplify"
> > the things back.
> 
> I cannot remember. Blame my brain. ;-)

no worries. I didn't remember it clearly as well, hence the "may be" part.

commit 08eee69fcf6baea543a2b4d2a2fcba0e61aa3160
Author: Minchan Kim

    zram: remove init_lock in zram_make_request
    
    Admin could reset zram during I/O operation going on so we have used
    zram->init_lock as read-side lock in I/O path to prevent sudden zram
    meta freeing.
    
    However, the init_lock is really troublesome.  We can't do call
    zram_meta_alloc under init_lock due to lockdep splat because
    zram_rw_page is one of the function under reclaim path and hold it as
    read_lock while other places in process context hold it as write_lock.
    So, we have used allocation out of the lock to avoid lockdep warn but
    it's not good for readability and fainally, I met another lockdep splat
    between init_lock and cpu_hotplug from kmem_cache_destroy during working
    zsmalloc compaction.  :(
    
    Yes, the ideal is to remove horrible init_lock of zram in rw path.  This
    patch removes it in rw path and instead, add atomic refcount for meta
    lifetime management and completion to free meta in process context.
    It's important to free meta in process context because some of resource
    destruction needs mutex lock, which could be held if we releases the
    resource in reclaim context so it's deadlock, again.
    
    As a bonus, we could remove init_done check in rw path because
    zram_meta_get will do a role for it, instead.

> Anyway, it's always welcome to make thing simple.
> Could you send a patch after settle down this patchset?

well, if it will improve anything after all :)

	-ss
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-08  6:50 ` [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability Minchan Kim
@ 2017-08-08 12:49   ` Matthew Wilcox
  2017-08-08 13:29     ` Matthew Wilcox
  2017-08-09  1:48     ` Minchan Kim
  0 siblings, 2 replies; 33+ messages in thread
From: Matthew Wilcox @ 2017-08-08 12:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

On Tue, Aug 08, 2017 at 03:50:20PM +0900, Minchan Kim wrote:
> There is no need to use dynamic bio allocation for BDI_CAP_SYNC
> devices. They can with on-stack-bio without concern about waiting
> bio allocation from mempool under heavy memory pressure.

This seems ... more complex than necessary?  Why not simply do this:

diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..6db6bf5131ed 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -157,6 +157,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 	unsigned page_block;
 	unsigned first_hole = blocks_per_page;
 	struct block_device *bdev = NULL;
+	struct bio sbio;
+	struct bio_vec sbvec;
 	int length;
 	int fully_mapped = 1;
 	unsigned nblocks;
@@ -281,10 +283,17 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 								page))
 				goto out;
 		}
-		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
+			bio = &sbio;
+			bio_init(bio, &sbvec, nr_pages);
+			sbio.bi_bdev = bdev;
+			sbio.bi_iter.bi_sector = blocks[0] << (blkbits - 9);
+		} else {
+			bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
 				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
-		if (bio == NULL)
-			goto confused;
+			if (bio == NULL)
+				goto confused;
+		}
 	}
 
 	length = first_hole << blkbits;
@@ -301,6 +310,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 	else
 		*last_block_in_bio = blocks[blocks_per_page - 1];
 out:
+	if (bio == &sbio)
+		bio = mpage_bio_submit(REQ_OP_READ, 0, bio);
 	return bio;
 
 confused:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-08 12:49   ` Matthew Wilcox
@ 2017-08-08 13:29     ` Matthew Wilcox
  2017-08-09  1:51       ` Minchan Kim
  2017-08-09  1:48     ` Minchan Kim
  1 sibling, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2017-08-08 13:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> +	struct bio sbio;
> +	struct bio_vec sbvec;

... this needs to be sbvec[nr_pages], of course.

> -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> +			bio = &sbio;
> +			bio_init(bio, &sbvec, nr_pages);

... and this needs to be 'sbvec', not '&sbvec'.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 5/6] zram: remove zram_rw_page
  2017-08-08  8:23       ` Sergey Senozhatsky
@ 2017-08-08 15:48         ` Matthew Wilcox
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2017-08-08 15:48 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, Ross Zwisler,
	karam . lee, seungho1.park, Christoph Hellwig, Dan Williams,
	Dave Chinner, jack, Jens Axboe, Vishal Verma, linux-nvdimm,
	kernel-team, Sergey Senozhatsky

On Tue, Aug 08, 2017 at 05:23:50PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (08/08/17 17:13), Minchan Kim wrote:
> > Hi Sergey,
> > 
> > On Tue, Aug 08, 2017 at 04:02:26PM +0900, Sergey Senozhatsky wrote:
> > > On (08/08/17 15:50), Minchan Kim wrote:
> > > > With on-stack-bio, rw_page interface doesn't provide a clear performance
> > > > benefit for zram and surely has a maintenance burden, so remove the
> > > > last user to remove rw_page completely.
> > > 
> > > OK, never really liked it, I think we had that conversation before.
> > > 
> > > as far as I remember, zram_rw_page() was the reason we had to do some
> > > tricks with init_lock to make lockdep happy. may be now we can "simplify"
> > > the things back.
> > 
> > I cannot remember. Blame my brain. ;-)
> 
> no worries. I didn't remember it clearly as well, hence the "may be" part.
> 
> commit 08eee69fcf6baea543a2b4d2a2fcba0e61aa3160
> Author: Minchan Kim
> 
>     zram: remove init_lock in zram_make_request
>     
>     Admin could reset zram during I/O operation going on so we have used
>     zram->init_lock as read-side lock in I/O path to prevent sudden zram
>     meta freeing.
>     
>     However, the init_lock is really troublesome.  We can't do call
>     zram_meta_alloc under init_lock due to lockdep splat because
>     zram_rw_page is one of the function under reclaim path and hold it as
>     read_lock while other places in process context hold it as write_lock.
>     So, we have used allocation out of the lock to avoid lockdep warn but
>     it's not good for readability and fainally, I met another lockdep splat
>     between init_lock and cpu_hotplug from kmem_cache_destroy during working
>     zsmalloc compaction.  :(

I don't think this patch is going to change anything with respect to the
use of init_lock.  You're still going to be called in the reclaim path,
no longer through rw_page, but through the bio path instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-08 12:49   ` Matthew Wilcox
  2017-08-08 13:29     ` Matthew Wilcox
@ 2017-08-09  1:48     ` Minchan Kim
  1 sibling, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-09  1:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

Hi Matthew,

On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 03:50:20PM +0900, Minchan Kim wrote:
> > There is no need to use dynamic bio allocation for BDI_CAP_SYNC
> > devices. They can with on-stack-bio without concern about waiting
> > bio allocation from mempool under heavy memory pressure.
> 
> This seems ... more complex than necessary?  Why not simply do this:
> 
> diff --git a/fs/mpage.c b/fs/mpage.c
> index baff8f820c29..6db6bf5131ed 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -157,6 +157,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  	unsigned page_block;
>  	unsigned first_hole = blocks_per_page;
>  	struct block_device *bdev = NULL;
> +	struct bio sbio;
> +	struct bio_vec sbvec;
>  	int length;
>  	int fully_mapped = 1;
>  	unsigned nblocks;
> @@ -281,10 +283,17 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  								page))
>  				goto out;
>  		}
> -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> +			bio = &sbio;
> +			bio_init(bio, &sbvec, nr_pages);
> +			sbio.bi_bdev = bdev;
> +			sbio.bi_iter.bi_sector = blocks[0] << (blkbits - 9);
> +		} else {
> +			bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
>  				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
> -		if (bio == NULL)
> -			goto confused;
> +			if (bio == NULL)
> +				goto confused;
> +		}
>  	}
>  
>  	length = first_hole << blkbits;
> @@ -301,6 +310,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  	else
>  		*last_block_in_bio = blocks[blocks_per_page - 1];
>  out:
> +	if (bio == &sbio)
> +		bio = mpage_bio_submit(REQ_OP_READ, 0, bio);

Looks nicer but one nitpick:

For reusing mpage_bio_submit, we need to call bio_get for on-stack-bio which
doesn't make sense to me but if you think it's more readable and ok with
overhead with two unnecessary atomic instructions(bio_get/put), I will do it
in next spin.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-08 13:29     ` Matthew Wilcox
@ 2017-08-09  1:51       ` Minchan Kim
  2017-08-09  2:31         ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-09  1:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, jack, linux-nvdimm, Dave Chinner, linux-kernel,
	Christoph Hellwig, linux-mm, kernel-team, seungho1.park,
	karam . lee, Andrew Morton

On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > +	struct bio sbio;
> > +	struct bio_vec sbvec;
> 
> ... this needs to be sbvec[nr_pages], of course.
> 
> > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > +			bio = &sbio;
> > +			bio_init(bio, &sbvec, nr_pages);
> 
> ... and this needs to be 'sbvec', not '&sbvec'.

I don't get it why we need sbvec[nr_pages].
On-stack-bio works with per-page.
May I miss something?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-09  1:51       ` Minchan Kim
@ 2017-08-09  2:31         ` Matthew Wilcox
  2017-08-09  2:41           ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2017-08-09  2:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > +	struct bio sbio;
> > > +	struct bio_vec sbvec;
> > 
> > ... this needs to be sbvec[nr_pages], of course.
> > 
> > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > +			bio = &sbio;
> > > +			bio_init(bio, &sbvec, nr_pages);
> > 
> > ... and this needs to be 'sbvec', not '&sbvec'.
> 
> I don't get it why we need sbvec[nr_pages].
> On-stack-bio works with per-page.
> May I miss something?

The way I redid it, it will work with an arbitrary number of pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-09  2:31         ` Matthew Wilcox
@ 2017-08-09  2:41           ` Minchan Kim
  2017-08-10  3:04             ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-09  2:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > > +	struct bio sbio;
> > > > +	struct bio_vec sbvec;
> > > 
> > > ... this needs to be sbvec[nr_pages], of course.
> > > 
> > > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > > +			bio = &sbio;
> > > > +			bio_init(bio, &sbvec, nr_pages);
> > > 
> > > ... and this needs to be 'sbvec', not '&sbvec'.
> > 
> > I don't get it why we need sbvec[nr_pages].
> > On-stack-bio works with per-page.
> > May I miss something?
> 
> The way I redid it, it will work with an arbitrary number of pages.

IIUC, it would be good things with dynamic bio alloction with passing
allocated bio back and forth but on-stack bio cannot work like that.
It should be done in per-page so it is worth?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-09  2:41           ` Minchan Kim
@ 2017-08-10  3:04             ` Matthew Wilcox
  2017-08-10  3:06               ` Dan Williams
  2017-08-10  4:00               ` Minchan Kim
  0 siblings, 2 replies; 33+ messages in thread
From: Matthew Wilcox @ 2017-08-10  3:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote:
> On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
> > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > > > +	struct bio sbio;
> > > > > +	struct bio_vec sbvec;
> > > > 
> > > > ... this needs to be sbvec[nr_pages], of course.
> > > > 
> > > > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > > > +			bio = &sbio;
> > > > > +			bio_init(bio, &sbvec, nr_pages);
> > > > 
> > > > ... and this needs to be 'sbvec', not '&sbvec'.
> > > 
> > > I don't get it why we need sbvec[nr_pages].
> > > On-stack-bio works with per-page.
> > > May I miss something?
> > 
> > The way I redid it, it will work with an arbitrary number of pages.
> 
> IIUC, it would be good things with dynamic bio alloction with passing
> allocated bio back and forth but on-stack bio cannot work like that.
> It should be done in per-page so it is worth?

I'm not passing the bio back and forth between do_mpage_readpage() and
its callers.  The version I sent allows for multiple pages in a single
on-stack bio (when called from mpage_readpages()).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-10  3:04             ` Matthew Wilcox
@ 2017-08-10  3:06               ` Dan Williams
  2017-08-11 10:46                 ` Christoph Hellwig
  2017-08-10  4:00               ` Minchan Kim
  1 sibling, 1 reply; 33+ messages in thread
From: Dan Williams @ 2017-08-10  3:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Minchan Kim, Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Ross Zwisler, karam . lee, seungho1.park, Christoph Hellwig,
	Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma, linux-nvdimm,
	kernel-team

On Wed, Aug 9, 2017 at 8:04 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote:
>> On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
>> > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
>> > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
>> > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
>> > > > > +     struct bio sbio;
>> > > > > +     struct bio_vec sbvec;
>> > > >
>> > > > ... this needs to be sbvec[nr_pages], of course.
>> > > >
>> > > > > -             bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
>> > > > > +             if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
>> > > > > +                     bio = &sbio;
>> > > > > +                     bio_init(bio, &sbvec, nr_pages);
>> > > >
>> > > > ... and this needs to be 'sbvec', not '&sbvec'.
>> > >
>> > > I don't get it why we need sbvec[nr_pages].
>> > > On-stack-bio works with per-page.
>> > > May I miss something?
>> >
>> > The way I redid it, it will work with an arbitrary number of pages.
>>
>> IIUC, it would be good things with dynamic bio alloction with passing
>> allocated bio back and forth but on-stack bio cannot work like that.
>> It should be done in per-page so it is worth?
>
> I'm not passing the bio back and forth between do_mpage_readpage() and
> its callers.  The version I sent allows for multiple pages in a single
> on-stack bio (when called from mpage_readpages()).

I like it, but do you think we should switch to sbvec[<constant>] to
preclude pathological cases where nr_pages is large?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-10  3:04             ` Matthew Wilcox
  2017-08-10  3:06               ` Dan Williams
@ 2017-08-10  4:00               ` Minchan Kim
  1 sibling, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-10  4:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-kernel, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Christoph Hellwig, Dan Williams, Dave Chinner,
	jack, Jens Axboe, Vishal Verma, linux-nvdimm, kernel-team

On Wed, Aug 09, 2017 at 08:04:33PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote:
> > On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
> > > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> > > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > > > > +	struct bio sbio;
> > > > > > +	struct bio_vec sbvec;
> > > > > 
> > > > > ... this needs to be sbvec[nr_pages], of course.
> > > > > 
> > > > > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > > > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > > > > +			bio = &sbio;
> > > > > > +			bio_init(bio, &sbvec, nr_pages);
> > > > > 
> > > > > ... and this needs to be 'sbvec', not '&sbvec'.
> > > > 
> > > > I don't get it why we need sbvec[nr_pages].
> > > > On-stack-bio works with per-page.
> > > > May I miss something?
> > > 
> > > The way I redid it, it will work with an arbitrary number of pages.
> > 
> > IIUC, it would be good things with dynamic bio alloction with passing
> > allocated bio back and forth but on-stack bio cannot work like that.
> > It should be done in per-page so it is worth?
> 
> I'm not passing the bio back and forth between do_mpage_readpage() and
> its callers.  The version I sent allows for multiple pages in a single
> on-stack bio (when called from mpage_readpages()).

I'm confused. I want to confirm your thought before respinning.
Please correct me if I miss something.

The version you sent to me used on-stack bio within do_mpage_readpage
so that's why I said sbvec[nr_pages] would be pointless because it
works with per-page base unless if we use dynamic bio allocation.

But I guess now you suggest to use on-stack bio in mpage_readpages so
single on-stack bio in mpage_readpages's stack can batch multiple pages
in bvecs of a bio.

Right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-10  3:06               ` Dan Williams
@ 2017-08-11 10:46                 ` Christoph Hellwig
  2017-08-11 14:26                   ` Jens Axboe
  2017-08-14  8:48                   ` Minchan Kim
  0 siblings, 2 replies; 33+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jan Kara, linux-mm, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	Minchan Kim, kernel-team, seungho1.park, Andrew Morton,
	karam . lee

On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> I like it, but do you think we should switch to sbvec[<constant>] to
> preclude pathological cases where nr_pages is large?

Yes, please.

Then I'd like to see that the on-stack bio even matters for
mpage_readpage / mpage_writepage.  Compared to all the buffer head
overhead the bio allocation should not actually matter in practice.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-11 10:46                 ` Christoph Hellwig
@ 2017-08-11 14:26                   ` Jens Axboe
  2017-08-14  8:50                     ` Minchan Kim
  2017-08-14  8:48                   ` Minchan Kim
  1 sibling, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2017-08-11 14:26 UTC (permalink / raw)
  To: Christoph Hellwig, Dan Williams
  Cc: Matthew Wilcox, Minchan Kim, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Dave Chinner, Jan Kara, Vishal Verma,
	linux-nvdimm, kernel-team

On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
>> I like it, but do you think we should switch to sbvec[<constant>] to
>> preclude pathological cases where nr_pages is large?
> 
> Yes, please.
> 
> Then I'd like to see that the on-stack bio even matters for
> mpage_readpage / mpage_writepage.  Compared to all the buffer head
> overhead the bio allocation should not actually matter in practice.

I'm skeptical for that path, too. I also wonder how far we could go
with just doing a per-cpu bio recycling facility, to reduce the cost
of having to allocate a bio. The on-stack bio parts are fine for
simple use case, where simple means that the patch just special
cases the allocation, and doesn't have to change much else.

I had a patch for bio recycling and batched freeing a year or two
ago, I'll see if I can find and resurrect it.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-11 10:46                 ` Christoph Hellwig
  2017-08-11 14:26                   ` Jens Axboe
@ 2017-08-14  8:48                   ` Minchan Kim
  1 sibling, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-14  8:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jan Kara, Andrew Morton, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, linux-mm, kernel-team,
	seungho1.park, karam . lee

Hi Christoph,

On Fri, Aug 11, 2017 at 12:46:15PM +0200, Christoph Hellwig wrote:
> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> > I like it, but do you think we should switch to sbvec[<constant>] to
> > preclude pathological cases where nr_pages is large?
> 
> Yes, please.

Still, I don't understand how sbvec[nr_pages] with on-stack bio in
do_mpage_readpage can help the performance.

IIUC, do_mpage_readpage works with page-base. IOW, it passes just one
page, not multiple pages so if we use on-stack bio, we just add *a page*
via bio_add_page and submit the bio before the function returning.

So, rather than sbvec[1], why de we need sbvec[nr_pages]?

Please, let me open my eyes. :)

Thanks.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-11 14:26                   ` Jens Axboe
@ 2017-08-14  8:50                     ` Minchan Kim
  2017-08-14 14:36                       ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-14  8:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

Hi Jens,

On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> > On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> >> I like it, but do you think we should switch to sbvec[<constant>] to
> >> preclude pathological cases where nr_pages is large?
> > 
> > Yes, please.
> > 
> > Then I'd like to see that the on-stack bio even matters for
> > mpage_readpage / mpage_writepage.  Compared to all the buffer head
> > overhead the bio allocation should not actually matter in practice.
> 
> I'm skeptical for that path, too. I also wonder how far we could go
> with just doing a per-cpu bio recycling facility, to reduce the cost
> of having to allocate a bio. The on-stack bio parts are fine for
> simple use case, where simple means that the patch just special
> cases the allocation, and doesn't have to change much else.
> 
> I had a patch for bio recycling and batched freeing a year or two
> ago, I'll see if I can find and resurrect it.

So, you want to go with per-cpu bio recycling approach to
remove rw_page?

So, do you want me to hold this patchset?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14  8:50                     ` Minchan Kim
@ 2017-08-14 14:36                       ` Jens Axboe
  2017-08-14 15:06                         ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2017-08-14 14:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Christoph Hellwig, Dan Williams, Matthew Wilcox, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Ross Zwisler, karam . lee,
	seungho1.park, Dave Chinner, Jan Kara, Vishal Verma,
	linux-nvdimm, kernel-team

On 08/14/2017 02:50 AM, Minchan Kim wrote:
> Hi Jens,
> 
> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
>>>> I like it, but do you think we should switch to sbvec[<constant>] to
>>>> preclude pathological cases where nr_pages is large?
>>>
>>> Yes, please.
>>>
>>> Then I'd like to see that the on-stack bio even matters for
>>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
>>> overhead the bio allocation should not actually matter in practice.
>>
>> I'm skeptical for that path, too. I also wonder how far we could go
>> with just doing a per-cpu bio recycling facility, to reduce the cost
>> of having to allocate a bio. The on-stack bio parts are fine for
>> simple use case, where simple means that the patch just special
>> cases the allocation, and doesn't have to change much else.
>>
>> I had a patch for bio recycling and batched freeing a year or two
>> ago, I'll see if I can find and resurrect it.
> 
> So, you want to go with per-cpu bio recycling approach to
> remove rw_page?
> 
> So, do you want me to hold this patchset?

I don't want to hold this series up, but I do think the recycling is
a cleaner approach since we don't need to special case anything. I
hope I'll get some time to dust it off, retest, and post soon.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14 14:36                       ` Jens Axboe
@ 2017-08-14 15:06                         ` Minchan Kim
  2017-08-14 15:14                           ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-14 15:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote:
> On 08/14/2017 02:50 AM, Minchan Kim wrote:
> > Hi Jens,
> > 
> > On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
> >> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> >>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> >>>> I like it, but do you think we should switch to sbvec[<constant>] to
> >>>> preclude pathological cases where nr_pages is large?
> >>>
> >>> Yes, please.
> >>>
> >>> Then I'd like to see that the on-stack bio even matters for
> >>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
> >>> overhead the bio allocation should not actually matter in practice.
> >>
> >> I'm skeptical for that path, too. I also wonder how far we could go
> >> with just doing a per-cpu bio recycling facility, to reduce the cost
> >> of having to allocate a bio. The on-stack bio parts are fine for
> >> simple use case, where simple means that the patch just special
> >> cases the allocation, and doesn't have to change much else.
> >>
> >> I had a patch for bio recycling and batched freeing a year or two
> >> ago, I'll see if I can find and resurrect it.
> > 
> > So, you want to go with per-cpu bio recycling approach to
> > remove rw_page?
> > 
> > So, do you want me to hold this patchset?
> 
> I don't want to hold this series up, but I do think the recycling is
> a cleaner approach since we don't need to special case anything. I
> hope I'll get some time to dust it off, retest, and post soon.

I don't know how your bio recycling works. But my worry when I heard
per-cpu bio recycling firstly is if it's not reserved pool for
BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages),
BIOs can be consumed by slow device(e.g., eMMC) so that a bio for
fastest device(e.g., zram in embedded system) in the system can be
stucked to wait on bio until IO for slow deivce is completed.

I guess it would be a not rare case for swap device under severe
memory pressure because lots of page cache are already reclaimed when
anonymous page start to be reclaimed so that many BIOs can be consumed
for eMMC to fetch code but swap IO to fetch heap data would be stucked
although zram-swap is much faster than eMMC.
As well, time to wait to get BIO among even fastest devices is
simple waste, I guess.

To me, bio suggested by Christoph Hellwig isn't diverge current
path a lot and simple enough to change.

Anyway, I'm okay with either way if we can remove rw_page without
any regression because the maintainance of both rw_page and
make_request is rather burden for zram, too.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14 15:06                         ` Minchan Kim
@ 2017-08-14 15:14                           ` Jens Axboe
  2017-08-14 15:31                             ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2017-08-14 15:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

On 08/14/2017 09:06 AM, Minchan Kim wrote:
> On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote:
>> On 08/14/2017 02:50 AM, Minchan Kim wrote:
>>> Hi Jens,
>>>
>>> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
>>>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
>>>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
>>>>>> I like it, but do you think we should switch to sbvec[<constant>] to
>>>>>> preclude pathological cases where nr_pages is large?
>>>>>
>>>>> Yes, please.
>>>>>
>>>>> Then I'd like to see that the on-stack bio even matters for
>>>>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
>>>>> overhead the bio allocation should not actually matter in practice.
>>>>
>>>> I'm skeptical for that path, too. I also wonder how far we could go
>>>> with just doing a per-cpu bio recycling facility, to reduce the cost
>>>> of having to allocate a bio. The on-stack bio parts are fine for
>>>> simple use case, where simple means that the patch just special
>>>> cases the allocation, and doesn't have to change much else.
>>>>
>>>> I had a patch for bio recycling and batched freeing a year or two
>>>> ago, I'll see if I can find and resurrect it.
>>>
>>> So, you want to go with per-cpu bio recycling approach to
>>> remove rw_page?
>>>
>>> So, do you want me to hold this patchset?
>>
>> I don't want to hold this series up, but I do think the recycling is
>> a cleaner approach since we don't need to special case anything. I
>> hope I'll get some time to dust it off, retest, and post soon.
> 
> I don't know how your bio recycling works. But my worry when I heard
> per-cpu bio recycling firstly is if it's not reserved pool for
> BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages),
> BIOs can be consumed by slow device(e.g., eMMC) so that a bio for
> fastest device(e.g., zram in embedded system) in the system can be
> stucked to wait on bio until IO for slow deivce is completed.
> 
> I guess it would be a not rare case for swap device under severe
> memory pressure because lots of page cache are already reclaimed when
> anonymous page start to be reclaimed so that many BIOs can be consumed
> for eMMC to fetch code but swap IO to fetch heap data would be stucked
> although zram-swap is much faster than eMMC.
> As well, time to wait to get BIO among even fastest devices is
> simple waste, I guess.

I don't think that's a valid concern. First of all, for the recycling,
it's not like you get to wait on someone else using a recycled bio,
if it's not there you simply go to the regular bio allocator. There
is no waiting for free. The idea is to have allocation be faster since
we can avoid going to the memory allocator for most cases, and speed
up freeing as well, since we can do that in batches too.

Secondly, generally you don't have slow devices and fast devices
intermingled when running workloads. That's the rare case.

> To me, bio suggested by Christoph Hellwig isn't diverge current
> path a lot and simple enough to change.

It doesn't diverge it a lot, but it does split it up.

> Anyway, I'm okay with either way if we can remove rw_page without
> any regression because the maintainance of both rw_page and
> make_request is rather burden for zram, too.

Agree, the ultimate goal of both is to eliminate the need for the
rw_page hack.

-- 
Jens Axboe

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14 15:14                           ` Jens Axboe
@ 2017-08-14 15:31                             ` Minchan Kim
  2017-08-14 15:38                               ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-14 15:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

On Mon, Aug 14, 2017 at 09:14:03AM -0600, Jens Axboe wrote:
> On 08/14/2017 09:06 AM, Minchan Kim wrote:
> > On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote:
> >> On 08/14/2017 02:50 AM, Minchan Kim wrote:
> >>> Hi Jens,
> >>>
> >>> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
> >>>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> >>>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> >>>>>> I like it, but do you think we should switch to sbvec[<constant>] to
> >>>>>> preclude pathological cases where nr_pages is large?
> >>>>>
> >>>>> Yes, please.
> >>>>>
> >>>>> Then I'd like to see that the on-stack bio even matters for
> >>>>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
> >>>>> overhead the bio allocation should not actually matter in practice.
> >>>>
> >>>> I'm skeptical for that path, too. I also wonder how far we could go
> >>>> with just doing a per-cpu bio recycling facility, to reduce the cost
> >>>> of having to allocate a bio. The on-stack bio parts are fine for
> >>>> simple use case, where simple means that the patch just special
> >>>> cases the allocation, and doesn't have to change much else.
> >>>>
> >>>> I had a patch for bio recycling and batched freeing a year or two
> >>>> ago, I'll see if I can find and resurrect it.
> >>>
> >>> So, you want to go with per-cpu bio recycling approach to
> >>> remove rw_page?
> >>>
> >>> So, do you want me to hold this patchset?
> >>
> >> I don't want to hold this series up, but I do think the recycling is
> >> a cleaner approach since we don't need to special case anything. I
> >> hope I'll get some time to dust it off, retest, and post soon.
> > 
> > I don't know how your bio recycling works. But my worry when I heard
> > per-cpu bio recycling firstly is if it's not reserved pool for
> > BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages),
> > BIOs can be consumed by slow device(e.g., eMMC) so that a bio for
> > fastest device(e.g., zram in embedded system) in the system can be
> > stucked to wait on bio until IO for slow deivce is completed.
> > 
> > I guess it would be a not rare case for swap device under severe
> > memory pressure because lots of page cache are already reclaimed when
> > anonymous page start to be reclaimed so that many BIOs can be consumed
> > for eMMC to fetch code but swap IO to fetch heap data would be stucked
> > although zram-swap is much faster than eMMC.
> > As well, time to wait to get BIO among even fastest devices is
> > simple waste, I guess.
> 
> I don't think that's a valid concern. First of all, for the recycling,
> it's not like you get to wait on someone else using a recycled bio,
> if it's not there you simply go to the regular bio allocator. There
> is no waiting for free. The idea is to have allocation be faster since
> we can avoid going to the memory allocator for most cases, and speed
> up freeing as well, since we can do that in batches too.

I doubt how it performs well because at the beginning of this
thread[1], Ross said that with even dynamic bio allocation without
rw_page, there is no regression in some testing.
[1] http://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>

> 
> Secondly, generally you don't have slow devices and fast devices
> intermingled when running workloads. That's the rare case.

Not true. zRam is really popular swap for embedded devices where
one of low cost product has a really poor slow nand compared to
lz4/lzo [de]comression.

> 
> > To me, bio suggested by Christoph Hellwig isn't diverge current
> > path a lot and simple enough to change.
> 
> It doesn't diverge it a lot, but it does split it up.
> 
> > Anyway, I'm okay with either way if we can remove rw_page without
> > any regression because the maintainance of both rw_page and
> > make_request is rather burden for zram, too.
> 
> Agree, the ultimate goal of both is to eliminate the need for the
> rw_page hack.

Yeb.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14 15:31                             ` Minchan Kim
@ 2017-08-14 15:38                               ` Jens Axboe
  2017-08-14 16:17                                 ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2017-08-14 15:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

On 08/14/2017 09:31 AM, Minchan Kim wrote:
>> Secondly, generally you don't have slow devices and fast devices
>> intermingled when running workloads. That's the rare case.
> 
> Not true. zRam is really popular swap for embedded devices where
> one of low cost product has a really poor slow nand compared to
> lz4/lzo [de]comression.

I guess that's true for some cases. But as I said earlier, the recycling
really doesn't care about this at all. They can happily coexist, and not
step on each others toes.

-- 
Jens Axboe

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14 15:38                               ` Jens Axboe
@ 2017-08-14 16:17                                 ` Jens Axboe
  2017-08-16  4:48                                   ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2017-08-14 16:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

On 08/14/2017 09:38 AM, Jens Axboe wrote:
> On 08/14/2017 09:31 AM, Minchan Kim wrote:
>>> Secondly, generally you don't have slow devices and fast devices
>>> intermingled when running workloads. That's the rare case.
>>
>> Not true. zRam is really popular swap for embedded devices where
>> one of low cost product has a really poor slow nand compared to
>> lz4/lzo [de]comression.
> 
> I guess that's true for some cases. But as I said earlier, the recycling
> really doesn't care about this at all. They can happily coexist, and not
> step on each others toes.

Dusted it off, result is here against -rc5:

http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache

I'd like to split the amount of units we cache and the amount of units
we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
once we hit that count, we free all of the, and then store the one we
were asked to free. That always keeps 1 local, but maybe it'd make more
sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
so that we retain more than 1 per cpu in case and app preempts when
sleeping for IO and the new task on that CPU then issues IO as well.
Probably minor.

Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
on the block device, so I disabled the __blkdev_direct_IO_simple()
bypass. With the above branch, we get ~18.0M IOPS, and without we get
~14M IOPS. Both ran with iostats disabled, to avoid any interference
from that.

-- 
Jens Axboe

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-14 16:17                                 ` Jens Axboe
@ 2017-08-16  4:48                                   ` Minchan Kim
  2017-08-16 15:56                                     ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2017-08-16  4:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

Hi Jens,

On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote:
> On 08/14/2017 09:38 AM, Jens Axboe wrote:
> > On 08/14/2017 09:31 AM, Minchan Kim wrote:
> >>> Secondly, generally you don't have slow devices and fast devices
> >>> intermingled when running workloads. That's the rare case.
> >>
> >> Not true. zRam is really popular swap for embedded devices where
> >> one of low cost product has a really poor slow nand compared to
> >> lz4/lzo [de]comression.
> > 
> > I guess that's true for some cases. But as I said earlier, the recycling
> > really doesn't care about this at all. They can happily coexist, and not
> > step on each others toes.
> 
> Dusted it off, result is here against -rc5:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache
> 
> I'd like to split the amount of units we cache and the amount of units
> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
> once we hit that count, we free all of the, and then store the one we
> were asked to free. That always keeps 1 local, but maybe it'd make more
> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
> so that we retain more than 1 per cpu in case and app preempts when
> sleeping for IO and the new task on that CPU then issues IO as well.
> Probably minor.
> 
> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
> on the block device, so I disabled the __blkdev_direct_IO_simple()
> bypass. With the above branch, we get ~18.0M IOPS, and without we get
> ~14M IOPS. Both ran with iostats disabled, to avoid any interference
> from that.

Looks promising.
If recycling bio works well enough, I think we don't need to introduce
new split in the path for on-stack bio.
I will test your version on zram-swap!

Thanks.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-16  4:48                                   ` Minchan Kim
@ 2017-08-16 15:56                                     ` Jens Axboe
  2017-08-21  6:13                                       ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2017-08-16 15:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

On 08/15/2017 10:48 PM, Minchan Kim wrote:
> Hi Jens,
> 
> On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote:
>> On 08/14/2017 09:38 AM, Jens Axboe wrote:
>>> On 08/14/2017 09:31 AM, Minchan Kim wrote:
>>>>> Secondly, generally you don't have slow devices and fast devices
>>>>> intermingled when running workloads. That's the rare case.
>>>>
>>>> Not true. zRam is really popular swap for embedded devices where
>>>> one of low cost product has a really poor slow nand compared to
>>>> lz4/lzo [de]comression.
>>>
>>> I guess that's true for some cases. But as I said earlier, the recycling
>>> really doesn't care about this at all. They can happily coexist, and not
>>> step on each others toes.
>>
>> Dusted it off, result is here against -rc5:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache
>>
>> I'd like to split the amount of units we cache and the amount of units
>> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
>> once we hit that count, we free all of the, and then store the one we
>> were asked to free. That always keeps 1 local, but maybe it'd make more
>> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
>> so that we retain more than 1 per cpu in case and app preempts when
>> sleeping for IO and the new task on that CPU then issues IO as well.
>> Probably minor.
>>
>> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
>> on the block device, so I disabled the __blkdev_direct_IO_simple()
>> bypass. With the above branch, we get ~18.0M IOPS, and without we get
>> ~14M IOPS. Both ran with iostats disabled, to avoid any interference
>> from that.
> 
> Looks promising.
> If recycling bio works well enough, I think we don't need to introduce
> new split in the path for on-stack bio.
> I will test your version on zram-swap!

Thanks, let me know how it goes. It's quite possible that we'll need
a few further tweaks, but at least the basis should be there.

-- 
Jens Axboe

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability
  2017-08-16 15:56                                     ` Jens Axboe
@ 2017-08-21  6:13                                       ` Minchan Kim
  0 siblings, 0 replies; 33+ messages in thread
From: Minchan Kim @ 2017-08-21  6:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel-team, Jan Kara, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, Matthew Wilcox, Christoph Hellwig,
	linux-mm, seungho1.park, Andrew Morton, karam . lee

Hi Jens,

On Wed, Aug 16, 2017 at 09:56:12AM -0600, Jens Axboe wrote:
> On 08/15/2017 10:48 PM, Minchan Kim wrote:
> > Hi Jens,
> > 
> > On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote:
> >> On 08/14/2017 09:38 AM, Jens Axboe wrote:
> >>> On 08/14/2017 09:31 AM, Minchan Kim wrote:
> >>>>> Secondly, generally you don't have slow devices and fast devices
> >>>>> intermingled when running workloads. That's the rare case.
> >>>>
> >>>> Not true. zRam is really popular swap for embedded devices where
> >>>> one of low cost product has a really poor slow nand compared to
> >>>> lz4/lzo [de]comression.
> >>>
> >>> I guess that's true for some cases. But as I said earlier, the recycling
> >>> really doesn't care about this at all. They can happily coexist, and not
> >>> step on each others toes.
> >>
> >> Dusted it off, result is here against -rc5:
> >>
> >> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache
> >>
> >> I'd like to split the amount of units we cache and the amount of units
> >> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
> >> once we hit that count, we free all of the, and then store the one we
> >> were asked to free. That always keeps 1 local, but maybe it'd make more
> >> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
> >> so that we retain more than 1 per cpu in case and app preempts when
> >> sleeping for IO and the new task on that CPU then issues IO as well.
> >> Probably minor.
> >>
> >> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
> >> on the block device, so I disabled the __blkdev_direct_IO_simple()
> >> bypass. With the above branch, we get ~18.0M IOPS, and without we get
> >> ~14M IOPS. Both ran with iostats disabled, to avoid any interference
> >> from that.
> > 
> > Looks promising.
> > If recycling bio works well enough, I think we don't need to introduce
> > new split in the path for on-stack bio.
> > I will test your version on zram-swap!
> 
> Thanks, let me know how it goes. It's quite possible that we'll need
> a few further tweaks, but at least the basis should be there.

Sorry for my late reply.

I just finished the swap-in testing in with zram-swap which is critical
for the latency.

For the testing, I made a memcc and put $NR_CPU(mine is 12) processes
in there and each processes consumes 1G so total is 12G while my system
has 16GB memory so there was no global reclaim.
Then, echo 1 > /mnt/memcg/group/force.empty to swap all pages out and
then the programs wait my signal to swap in and I trigger the signal
to every processes to swap in every pages and measures elapsed time
for the swapin.

the value is average usec time elapsed swap-in 1G pages for each process
and I repeated it 10times and stddev is very stable.

swapin:
base(with rw_page)      1100806.73(100.00%)
no-rw_page              1146856.95(104.18%)
Jens's pcp              1146910.00(104.19%)
onstack-bio             1114872.18(101.28%)

In my test, there is no difference between dynamic bio allocation
(i.e., no-rwpage) and pcp approch but onstack-bio is much faster
so it's almost same with rw_page.

swapout test is to measure elapsed time for "echo 1 > /mnt/memcg/test_group/force.empty'
so it's sec unit.

swapout:
base(with rw_page)      7.72(100.00%)
no-rw_page              8.36(108.29%)
Jens's pcp              8.31(107.64%)
onstack-bio             8.19(106.09%)

rw_page's swapout is 6% or more than faster than else.

I tried pmbenchmak with no memcg to see the performance in global reclaim.
Also, I executed background IO job which reads data from HDD.
The value is average usec time elapsed for a page access so smaller is
better.

base(with rw_page)              14.42(100.00%)
no-rw_page                      15.66(108.60%)
Jens's pcp                      15.81(109.64%)
onstack-bio                     15.42(106.93%)

It's similar to swapout test in memcg.
6% or more is not trivial so I doubt we can remove rw_page
at this moment. :(

I will look into the detail with perf.
If you have further optimizations or suggestions, Feel free to
say that. I am happy to test it.

Thanks.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-08-21  6:11 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-08  6:50 [PATCH v1 0/6] Remove rw_page Minchan Kim
2017-08-08  6:50 ` [PATCH v1 1/6] bdi: introduce BDI_CAP_SYNC Minchan Kim
2017-08-08  6:50 ` [PATCH v1 2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability Minchan Kim
2017-08-08 12:49   ` Matthew Wilcox
2017-08-08 13:29     ` Matthew Wilcox
2017-08-09  1:51       ` Minchan Kim
2017-08-09  2:31         ` Matthew Wilcox
2017-08-09  2:41           ` Minchan Kim
2017-08-10  3:04             ` Matthew Wilcox
2017-08-10  3:06               ` Dan Williams
2017-08-11 10:46                 ` Christoph Hellwig
2017-08-11 14:26                   ` Jens Axboe
2017-08-14  8:50                     ` Minchan Kim
2017-08-14 14:36                       ` Jens Axboe
2017-08-14 15:06                         ` Minchan Kim
2017-08-14 15:14                           ` Jens Axboe
2017-08-14 15:31                             ` Minchan Kim
2017-08-14 15:38                               ` Jens Axboe
2017-08-14 16:17                                 ` Jens Axboe
2017-08-16  4:48                                   ` Minchan Kim
2017-08-16 15:56                                     ` Jens Axboe
2017-08-21  6:13                                       ` Minchan Kim
2017-08-14  8:48                   ` Minchan Kim
2017-08-10  4:00               ` Minchan Kim
2017-08-09  1:48     ` Minchan Kim
2017-08-08  6:50 ` [PATCH v1 3/6] mm:swap: remove end_swap_bio_write argument Minchan Kim
2017-08-08  6:50 ` [PATCH v1 4/6] mm:swap: use on-stack-bio for BDI_CAP_SYNC devices Minchan Kim
2017-08-08  6:50 ` [PATCH v1 5/6] zram: remove zram_rw_page Minchan Kim
2017-08-08  7:02   ` Sergey Senozhatsky
2017-08-08  8:13     ` Minchan Kim
2017-08-08  8:23       ` Sergey Senozhatsky
2017-08-08 15:48         ` Matthew Wilcox
2017-08-08  6:50 ` [PATCH v1 6/6] fs: remove rw_page Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).