All of lore.kernel.org
 help / color / mirror / Atom feed
* remove ->rw_page
@ 2023-01-25 13:34 Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 1/7] mpage: stop using bdev_{read,write}_page Christoph Hellwig
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Hi all,

this series removes the ->rw_page block_device_operation, which is an old
and clumsy attempt at a simple read/write fast path for the block layer.
It isn't actually used by the fastest block layer operations that we
support (polled I/O through io_uring), but only used by the mpage buffered
I/O helpers which are some of the slowest I/O we have and do not make any
difference there at all, and zram which is a block device abused to
duplicate the zram functionality.  Given that zram is heavily used we
need to make sure there is a good replacement for synchronous I/O, so
this series adds a new flag for drivers that complete I/O synchronously
and uses that flag to use on-stack bios and synchronous submission for
them in the swap code.

Diffstat:
 block/bdev.c                  |   78 ------------------
 drivers/block/brd.c           |   15 ---
 drivers/block/zram/zram_drv.c |   61 --------------
 drivers/nvdimm/btt.c          |   16 ---
 drivers/nvdimm/pmem.c         |   24 -----
 fs/mpage.c                    |   10 --
 include/linux/blkdev.h        |   12 +-
 mm/page_io.c                  |  182 ++++++++++++++++++++++--------------------
 mm/swap.h                     |    9 --
 mm/swapfile.c                 |    2 
 10 files changed, 114 insertions(+), 295 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/7] mpage: stop using bdev_{read,write}_page
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 17:58   ` Dan Williams
  2023-01-25 13:34 ` [PATCH 2/7] mm: remove the swap_readpage return value Christoph Hellwig
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

These are micro-optimizations for synchronous I/O, which do not matter
compared to all the other inefficiencies in the legacy buffer_head
based mpage code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/mpage.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 0f8ae954a57903..124550cfac4a70 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -269,11 +269,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 
 alloc_new:
 	if (args->bio == NULL) {
-		if (first_hole == blocks_per_page) {
-			if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
-								&folio->page))
-				goto out;
-		}
 		args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf,
 				      gfp);
 		if (args->bio == NULL)
@@ -579,11 +574,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 
 alloc_new:
 	if (bio == NULL) {
-		if (first_unmapped == blocks_per_page) {
-			if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
-								page, wbc))
-				goto out;
-		}
 		bio = bio_alloc(bdev, BIO_MAX_VECS,
 				REQ_OP_WRITE | wbc_to_write_flags(wbc),
 				GFP_NOFS);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/7] mm: remove the swap_readpage return value
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 1/7] mpage: stop using bdev_{read,write}_page Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 15:58   ` Keith Busch
  2023-01-25 18:00   ` Dan Williams
  2023-01-25 13:34 ` [PATCH 3/7] mm: factor out a swap_readpage_bdev helper Christoph Hellwig
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

swap_readpage always returns 0, and no caller checks the return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/page_io.c | 16 +++++-----------
 mm/swap.h    |  7 +++----
 2 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 3a5f921b932e82..6f7166fdc4b2bb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -445,11 +445,9 @@ static void swap_readpage_fs(struct page *page,
 		*plug = sio;
 }
 
-int swap_readpage(struct page *page, bool synchronous,
-		  struct swap_iocb **plug)
+void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
 {
 	struct bio *bio;
-	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
 	bool workingset = PageWorkingset(page);
 	unsigned long pflags;
@@ -481,15 +479,12 @@ int swap_readpage(struct page *page, bool synchronous,
 		goto out;
 	}
 
-	if (sis->flags & SWP_SYNCHRONOUS_IO) {
-		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
-		if (!ret) {
-			count_vm_event(PSWPIN);
-			goto out;
-		}
+	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
+	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
+		count_vm_event(PSWPIN);
+		goto out;
 	}
 
-	ret = 0;
 	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
 	bio->bi_iter.bi_sector = swap_page_sector(page);
 	bio->bi_end_io = end_swap_bio_read;
@@ -521,7 +516,6 @@ int swap_readpage(struct page *page, bool synchronous,
 		psi_memstall_leave(&pflags);
 	}
 	delayacct_swapin_end();
-	return ret;
 }
 
 void __swap_read_unplug(struct swap_iocb *sio)
diff --git a/mm/swap.h b/mm/swap.h
index f78065c8ef524b..f5eb5069d28c2e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -8,8 +8,7 @@
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
-int swap_readpage(struct page *page, bool do_poll,
-		  struct swap_iocb **plug);
+void swap_readpage(struct page *page, bool do_poll, struct swap_iocb **plug);
 void __swap_read_unplug(struct swap_iocb *plug);
 static inline void swap_read_unplug(struct swap_iocb *plug)
 {
@@ -64,8 +63,8 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 }
 #else /* CONFIG_SWAP */
 struct swap_iocb;
-static inline int swap_readpage(struct page *page, bool do_poll,
-				struct swap_iocb **plug)
+static inline void swap_readpage(struct page *page, bool do_poll,
+		struct swap_iocb **plug)
 {
 	return 0;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/7] mm: factor out a swap_readpage_bdev helper
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 1/7] mpage: stop using bdev_{read,write}_page Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 2/7] mm: remove the swap_readpage return value Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 18:30   ` Dan Williams
  2023-01-25 13:34 ` [PATCH 4/7] mm: use an on-stack bio for synchronous swapin Christoph Hellwig
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Split the block device case from swap_readpage into a separate helper,
following the abstraction for file based swap and frontswap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/page_io.c | 68 +++++++++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 6f7166fdc4b2bb..ce0b3638094f85 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -445,44 +445,15 @@ static void swap_readpage_fs(struct page *page,
 		*plug = sio;
 }
 
-void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
+static void swap_readpage_bdev(struct page *page, bool synchronous,
+		struct swap_info_struct *sis)
 {
 	struct bio *bio;
-	struct swap_info_struct *sis = page_swap_info(page);
-	bool workingset = PageWorkingset(page);
-	unsigned long pflags;
-	bool in_thrashing;
-
-	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(PageUptodate(page), page);
-
-	/*
-	 * Count submission time as memory stall and delay. When the device
-	 * is congested, or the submitting cgroup IO-throttled, submission
-	 * can be a significant part of overall IO time.
-	 */
-	if (workingset) {
-		delayacct_thrashing_start(&in_thrashing);
-		psi_memstall_enter(&pflags);
-	}
-	delayacct_swapin_start();
-
-	if (frontswap_load(page) == 0) {
-		SetPageUptodate(page);
-		unlock_page(page);
-		goto out;
-	}
-
-	if (data_race(sis->flags & SWP_FS_OPS)) {
-		swap_readpage_fs(page, plug);
-		goto out;
-	}
 
 	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
 	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
 		count_vm_event(PSWPIN);
-		goto out;
+		return;
 	}
 
 	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
@@ -509,8 +480,39 @@ void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
 	}
 	__set_current_state(TASK_RUNNING);
 	bio_put(bio);
+}
+
+void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+	bool workingset = PageWorkingset(page);
+	unsigned long pflags;
+	bool in_thrashing;
+
+	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageUptodate(page), page);
+
+	/*
+	 * Count submission time as memory stall and delay. When the device
+	 * is congested, or the submitting cgroup IO-throttled, submission
+	 * can be a significant part of overall IO time.
+	 */
+	if (workingset) {
+		delayacct_thrashing_start(&in_thrashing);
+		psi_memstall_enter(&pflags);
+	}
+	delayacct_swapin_start();
+
+	if (frontswap_load(page) == 0) {
+		SetPageUptodate(page);
+		unlock_page(page);
+	} else if (data_race(sis->flags & SWP_FS_OPS)) {
+		swap_readpage_fs(page, plug);
+	} else {
+		swap_readpage_bdev(page, synchronous, sis);
+	}
 
-out:
 	if (workingset) {
 		delayacct_thrashing_end(&in_thrashing);
 		psi_memstall_leave(&pflags);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/7] mm: use an on-stack bio for synchronous swapin
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-01-25 13:34 ` [PATCH 3/7] mm: factor out a swap_readpage_bdev helper Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 5/7] mm: remove the __swap_writepage return value Christoph Hellwig
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Optimize the synchronous swap in case by using an on-stack bio instead
of allocating one using bio_alloc.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/page_io.c | 69 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 31 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index ce0b3638094f85..21ce4505f00607 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -52,10 +52,9 @@ static void end_swap_bio_write(struct bio *bio)
 	bio_put(bio);
 }
 
-static void end_swap_bio_read(struct bio *bio)
+static void __end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
-	struct task_struct *waiter = bio->bi_private;
 
 	if (bio->bi_status) {
 		SetPageError(page);
@@ -63,18 +62,16 @@ static void end_swap_bio_read(struct bio *bio)
 		pr_alert_ratelimited("Read-error on swap-device (%u:%u:%llu)\n",
 				     MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)),
 				     (unsigned long long)bio->bi_iter.bi_sector);
-		goto out;
+	} else {
+		SetPageUptodate(page);
 	}
-
-	SetPageUptodate(page);
-out:
 	unlock_page(page);
-	WRITE_ONCE(bio->bi_private, NULL);
+}
+
+static void end_swap_bio_read(struct bio *bio)
+{
+	__end_swap_bio_read(bio);
 	bio_put(bio);
-	if (waiter) {
-		blk_wake_io_task(waiter);
-		put_task_struct(waiter);
-	}
 }
 
 int generic_swapfile_activate(struct swap_info_struct *sis,
@@ -445,10 +442,11 @@ static void swap_readpage_fs(struct page *page,
 		*plug = sio;
 }
 
-static void swap_readpage_bdev(struct page *page, bool synchronous,
+static void swap_readpage_bdev_sync(struct page *page,
 		struct swap_info_struct *sis)
 {
-	struct bio *bio;
+	struct bio_vec bv;
+	struct bio bio;
 
 	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
 	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
@@ -456,30 +454,37 @@ static void swap_readpage_bdev(struct page *page, bool synchronous,
 		return;
 	}
 
-	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
-	bio->bi_iter.bi_sector = swap_page_sector(page);
-	bio->bi_end_io = end_swap_bio_read;
-	bio_add_page(bio, page, thp_size(page), 0);
+	bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_READ);
+	bio.bi_iter.bi_sector = swap_page_sector(page);
+	bio_add_page(&bio, page, thp_size(page), 0);
 	/*
 	 * Keep this task valid during swap readpage because the oom killer may
 	 * attempt to access it in the page fault retry time check.
 	 */
-	if (synchronous) {
-		get_task_struct(current);
-		bio->bi_private = current;
-	}
+	get_task_struct(current);
 	count_vm_event(PSWPIN);
-	bio_get(bio);
-	submit_bio(bio);
-	while (synchronous) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio->bi_private))
-			break;
+	submit_bio_wait(&bio);
+	__end_swap_bio_read(&bio);
+	put_task_struct(current);
+}
+
+static void swap_readpage_bdev_async(struct page *page,
+		struct swap_info_struct *sis)
+{
+	struct bio *bio;
 
-		blk_io_schedule();
+	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
+	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
+		count_vm_event(PSWPIN);
+		return;
 	}
-	__set_current_state(TASK_RUNNING);
-	bio_put(bio);
+
+	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
+	bio->bi_iter.bi_sector = swap_page_sector(page);
+	bio->bi_end_io = end_swap_bio_read;
+	bio_add_page(bio, page, thp_size(page), 0);
+	count_vm_event(PSWPIN);
+	submit_bio(bio);
 }
 
 void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
@@ -509,8 +514,10 @@ void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
 		unlock_page(page);
 	} else if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_readpage_fs(page, plug);
+	} else if (synchronous) {
+		swap_readpage_bdev_sync(page, sis);
 	} else {
-		swap_readpage_bdev(page, synchronous, sis);
+		swap_readpage_bdev_async(page, sis);
 	}
 
 	if (workingset) {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/7] mm: remove the __swap_writepage return value
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-01-25 13:34 ` [PATCH 4/7] mm: use an on-stack bio for synchronous swapin Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 6/7] mm: factor out a swap_writepage_bdev helper Christoph Hellwig
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

__swap_writepage always returns 0.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/page_io.c | 23 +++++++++--------------
 mm/swap.h    |  2 +-
 2 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 21ce4505f00607..c373d5694cdffd 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -178,11 +178,11 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 int swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct folio *folio = page_folio(page);
-	int ret = 0;
+	int ret;
 
 	if (folio_free_swap(folio)) {
 		folio_unlock(folio);
-		goto out;
+		return 0;
 	}
 	/*
 	 * Arch code may have to preserve more data than just the page
@@ -192,17 +192,16 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	if (ret) {
 		folio_mark_dirty(folio);
 		folio_unlock(folio);
-		goto out;
+		return ret;
 	}
 	if (frontswap_store(&folio->page) == 0) {
 		folio_start_writeback(folio);
 		folio_unlock(folio);
 		folio_end_writeback(folio);
-		goto out;
+		return 0;
 	}
-	ret = __swap_writepage(&folio->page, wbc);
-out:
-	return ret;
+	__swap_writepage(&folio->page, wbc);
+	return 0;
 }
 
 static inline void count_swpout_vm_event(struct page *page)
@@ -289,7 +288,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 	mempool_free(sio, sio_pool);
 }
 
-static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
+static void swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 {
 	struct swap_iocb *sio = NULL;
 	struct swap_info_struct *sis = page_swap_info(page);
@@ -326,11 +325,9 @@ static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 	}
 	if (wbc->swap_plug)
 		*wbc->swap_plug = sio;
-
-	return 0;
 }
 
-int __swap_writepage(struct page *page, struct writeback_control *wbc)
+void __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct bio *bio;
 	int ret;
@@ -348,7 +345,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
 	if (!ret) {
 		count_swpout_vm_event(page);
-		return 0;
+		return;
 	}
 
 	bio = bio_alloc(sis->bdev, 1,
@@ -363,8 +360,6 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(bio);
-
-	return 0;
 }
 
 void swap_write_unplug(struct swap_iocb *sio)
diff --git a/mm/swap.h b/mm/swap.h
index f5eb5069d28c2e..28be6cb3277fa4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -17,7 +17,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 }
 void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
-int __swap_writepage(struct page *page, struct writeback_control *wbc);
+void __swap_writepage(struct page *page, struct writeback_control *wbc);
 
 /* linux/mm/swap_state.c */
 /* One swap address space for each 64M swap space */
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/7] mm: factor out a swap_writepage_bdev helper
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-01-25 13:34 ` [PATCH 5/7] mm: remove the __swap_writepage return value Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 13:34 ` [PATCH 7/7] block: remove ->rw_page Christoph Hellwig
  2023-01-25 14:32 ` Jens Axboe
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Split the block device case from swap_readpage into a separate helper,
following the abstraction for file based swap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/page_io.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index c373d5694cdffd..2ee2bfe5de0386 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -327,23 +327,12 @@ static void swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 		*wbc->swap_plug = sio;
 }
 
-void __swap_writepage(struct page *page, struct writeback_control *wbc)
+static void swap_writepage_bdev(struct page *page,
+		struct writeback_control *wbc, struct swap_info_struct *sis)
 {
 	struct bio *bio;
-	int ret;
-	struct swap_info_struct *sis = page_swap_info(page);
-
-	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	/*
-	 * ->flags can be updated non-atomicially (scan_swap_map_slots),
-	 * but that will never affect SWP_FS_OPS, so the data_race
-	 * is safe.
-	 */
-	if (data_race(sis->flags & SWP_FS_OPS))
-		return swap_writepage_fs(page, wbc);
 
-	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
-	if (!ret) {
+	if (!bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc)) {
 		count_swpout_vm_event(page);
 		return;
 	}
@@ -362,6 +351,22 @@ void __swap_writepage(struct page *page, struct writeback_control *wbc)
 	submit_bio(bio);
 }
 
+void __swap_writepage(struct page *page, struct writeback_control *wbc)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+
+	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+	/*
+	 * ->flags can be updated non-atomicially (scan_swap_map_slots),
+	 * but that will never affect SWP_FS_OPS, so the data_race
+	 * is safe.
+	 */
+	if (data_race(sis->flags & SWP_FS_OPS))
+		swap_writepage_fs(page, wbc);
+	else
+		swap_writepage_bdev(page, wbc, sis);
+}
+
 void swap_write_unplug(struct swap_iocb *sio)
 {
 	struct iov_iter from;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/7] block: remove ->rw_page
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
                   ` (5 preceding siblings ...)
  2023-01-25 13:34 ` [PATCH 6/7] mm: factor out a swap_writepage_bdev helper Christoph Hellwig
@ 2023-01-25 13:34 ` Christoph Hellwig
  2023-01-25 16:28   ` Keith Busch
  2023-01-25 18:38   ` Dan Williams
  2023-01-25 14:32 ` Jens Axboe
  7 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-25 13:34 UTC (permalink / raw)
  To: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

The ->rw_page method is a special purpose bypass of the usual bio
handling path that is limited to single-page reads and writes and
synchronous which causes a lot of extra code in the drivers, callers
and the block layer.

The only remaining user is the MM swap code.  Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it
based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers
that currently implement ->rw_page instead.  While this touches one
extra cache line and executes extra code, it simplifies the block
layer and drivers and ensures that all feastures are properly supported
by all drivers, e.g. right now ->rw_page bypassed cgroup writeback
entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bdev.c                  | 78 -----------------------------------
 drivers/block/brd.c           | 15 +------
 drivers/block/zram/zram_drv.c | 61 +--------------------------
 drivers/nvdimm/btt.c          | 16 +------
 drivers/nvdimm/pmem.c         | 24 +----------
 include/linux/blkdev.h        | 12 +++---
 mm/page_io.c                  | 53 ++++++++++++++----------
 mm/swapfile.c                 |  2 +-
 8 files changed, 44 insertions(+), 217 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index edc110d90df404..1795c7d4b99efa 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -304,84 +304,6 @@ int thaw_bdev(struct block_device *bdev)
 }
 EXPORT_SYMBOL(thaw_bdev);
 
-/**
- * bdev_read_page() - Start reading a page from a block device
- * @bdev: The device to read the page from
- * @sector: The offset on the device to read the page to (need not be aligned)
- * @page: The page to read
- *
- * On entry, the page should be locked.  It will be unlocked when the page
- * has been read.  If the block driver implements rw_page synchronously,
- * that will be true on exit from this function, but it need not be.
- *
- * Errors returned by this function are usually "soft", eg out of memory, or
- * queue full; callers should try a different route to read this page rather
- * than propagate an error back up the stack.
- *
- * Return: negative errno if an error occurs, 0 if submission was successful.
- */
-int bdev_read_page(struct block_device *bdev, sector_t sector,
-			struct page *page)
-{
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	int result = -EOPNOTSUPP;
-
-	if (!ops->rw_page || bdev_get_integrity(bdev))
-		return result;
-
-	result = blk_queue_enter(bdev_get_queue(bdev), 0);
-	if (result)
-		return result;
-	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page,
-			      REQ_OP_READ);
-	blk_queue_exit(bdev_get_queue(bdev));
-	return result;
-}
-
-/**
- * bdev_write_page() - Start writing a page to a block device
- * @bdev: The device to write the page to
- * @sector: The offset on the device to write the page to (need not be aligned)
- * @page: The page to write
- * @wbc: The writeback_control for the write
- *
- * On entry, the page should be locked and not currently under writeback.
- * On exit, if the write started successfully, the page will be unlocked and
- * under writeback.  If the write failed already (eg the driver failed to
- * queue the page to the device), the page will still be locked.  If the
- * caller is a ->writepage implementation, it will need to unlock the page.
- *
- * Errors returned by this function are usually "soft", eg out of memory, or
- * queue full; callers should try a different route to write this page rather
- * than propagate an error back up the stack.
- *
- * Return: negative errno if an error occurs, 0 if submission was successful.
- */
-int bdev_write_page(struct block_device *bdev, sector_t sector,
-			struct page *page, struct writeback_control *wbc)
-{
-	int result;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-
-	if (!ops->rw_page || bdev_get_integrity(bdev))
-		return -EOPNOTSUPP;
-	result = blk_queue_enter(bdev_get_queue(bdev), 0);
-	if (result)
-		return result;
-
-	set_page_writeback(page);
-	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page,
-			      REQ_OP_WRITE);
-	if (result) {
-		end_page_writeback(page);
-	} else {
-		clean_page_buffers(page);
-		unlock_page(page);
-	}
-	blk_queue_exit(bdev_get_queue(bdev));
-	return result;
-}
-
 /*
  * pseudo-fs
  */
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 20acc4a1fd6def..37dce184eb56c6 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -309,23 +309,9 @@ static void brd_submit_bio(struct bio *bio)
 	bio_endio(bio);
 }
 
-static int brd_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, enum req_op op)
-{
-	struct brd_device *brd = bdev->bd_disk->private_data;
-	int err;
-
-	if (PageTransHuge(page))
-		return -ENOTSUPP;
-	err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector);
-	page_endio(page, op_is_write(op), err);
-	return err;
-}
-
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.submit_bio =		brd_submit_bio,
-	.rw_page =		brd_rw_page,
 };
 
 /*
@@ -411,6 +397,7 @@ static int brd_alloc(int i)
 
 	/* Tell the block layer that this is not a rotational device */
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
 	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue);
 	err = add_disk(disk);
 	if (err)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e290d6d970474e..4d5af9bbbea649 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1453,10 +1453,6 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
 		/* Slot should be unlocked before the function call */
 		zram_slot_unlock(zram, index);
 
-		/* A null bio means rw_page was used, we must fallback to bio */
-		if (!bio)
-			return -EOPNOTSUPP;
-
 		ret = zram_bvec_read_from_bdev(zram, page, index, bio,
 					       partial_io);
 	}
@@ -2081,61 +2077,6 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram_slot_unlock(zram, index);
 }
 
-static int zram_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, enum req_op op)
-{
-	int offset, ret;
-	u32 index;
-	struct zram *zram;
-	struct bio_vec bv;
-	unsigned long start_time;
-
-	if (PageTransHuge(page))
-		return -ENOTSUPP;
-	zram = bdev->bd_disk->private_data;
-
-	if (!valid_io_request(zram, sector, PAGE_SIZE)) {
-		atomic64_inc(&zram->stats.invalid_io);
-		ret = -EINVAL;
-		goto out;
-	}
-
-	index = sector >> SECTORS_PER_PAGE_SHIFT;
-	offset = (sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
-
-	bv.bv_page = page;
-	bv.bv_len = PAGE_SIZE;
-	bv.bv_offset = 0;
-
-	start_time = bdev_start_io_acct(bdev->bd_disk->part0,
-			SECTORS_PER_PAGE, op, jiffies);
-	ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL);
-	bdev_end_io_acct(bdev->bd_disk->part0, op, start_time);
-out:
-	/*
-	 * If I/O fails, just return error(ie, non-zero) without
-	 * calling page_endio.
-	 * It causes resubmit the I/O with bio request by upper functions
-	 * of rw_page(e.g., swap_readpage, __swap_writepage) and
-	 * bio->bi_end_io does things to handle the error
-	 * (e.g., SetPageError, set_page_dirty and extra works).
-	 */
-	if (unlikely(ret < 0))
-		return ret;
-
-	switch (ret) {
-	case 0:
-		page_endio(page, op_is_write(op), 0);
-		break;
-	case 1:
-		ret = 0;
-		break;
-	default:
-		WARN_ON(1);
-	}
-	return ret;
-}
-
 static void zram_destroy_comps(struct zram *zram)
 {
 	u32 prio;
@@ -2290,7 +2231,6 @@ static const struct block_device_operations zram_devops = {
 	.open = zram_open,
 	.submit_bio = zram_submit_bio,
 	.swap_slot_free_notify = zram_slot_free_notify,
-	.rw_page = zram_rw_page,
 	.owner = THIS_MODULE
 };
 
@@ -2389,6 +2329,7 @@ static int zram_add(void)
 	set_capacity(zram->disk, 0);
 	/* zram devices sort of resembles non-rotational disks */
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, zram->disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, zram->disk->queue);
 	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, zram->disk->queue);
 
 	/*
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 0297b7882e33bd..d5593b0dc7009c 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1482,20 +1482,6 @@ static void btt_submit_bio(struct bio *bio)
 	bio_endio(bio);
 }
 
-static int btt_rw_page(struct block_device *bdev, sector_t sector,
-		struct page *page, enum req_op op)
-{
-	struct btt *btt = bdev->bd_disk->private_data;
-	int rc;
-
-	rc = btt_do_bvec(btt, NULL, page, thp_size(page), 0, op, sector);
-	if (rc == 0)
-		page_endio(page, op_is_write(op), 0);
-
-	return rc;
-}
-
-
 static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
 	/* some standard values */
@@ -1508,7 +1494,6 @@ static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 static const struct block_device_operations btt_fops = {
 	.owner =		THIS_MODULE,
 	.submit_bio =		btt_submit_bio,
-	.rw_page =		btt_rw_page,
 	.getgeo =		btt_getgeo,
 };
 
@@ -1530,6 +1515,7 @@ static int btt_blk_init(struct btt *btt)
 	blk_queue_logical_block_size(btt->btt_disk->queue, btt->sector_size);
 	blk_queue_max_hw_sectors(btt->btt_disk->queue, UINT_MAX);
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, btt->btt_disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, btt->btt_disk->queue);
 
 	if (btt_meta_size(btt)) {
 		rc = nd_integrity_init(btt->btt_disk, btt_meta_size(btt));
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 96e6e9a5f235d8..ceea55f621cc7f 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -238,28 +238,6 @@ static void pmem_submit_bio(struct bio *bio)
 	bio_endio(bio);
 }
 
-static int pmem_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, enum req_op op)
-{
-	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	blk_status_t rc;
-
-	if (op_is_write(op))
-		rc = pmem_do_write(pmem, page, 0, sector, thp_size(page));
-	else
-		rc = pmem_do_read(pmem, page, 0, sector, thp_size(page));
-	/*
-	 * The ->rw_page interface is subtle and tricky.  The core
-	 * retries on any error, so we can only invoke page_endio() in
-	 * the successful completion case.  Otherwise, we'll see crashes
-	 * caused by double completion.
-	 */
-	if (rc == 0)
-		page_endio(page, op_is_write(op), 0);
-
-	return blk_status_to_errno(rc);
-}
-
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, enum dax_access_mode mode, void **kaddr,
@@ -310,7 +288,6 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.submit_bio =		pmem_submit_bio,
-	.rw_page =		pmem_rw_page,
 };
 
 static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
@@ -565,6 +542,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_logical_block_size(q, pmem_sector_size(ndns));
 	blk_queue_max_hw_sectors(q, UINT_MAX);
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
+	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
 	if (pmem->pfn_flags & PFN_MAP)
 		blk_queue_flag_set(QUEUE_FLAG_DAX, q);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89f51d68c68ad6..1bffe8f44ae9a8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -555,6 +555,7 @@ struct request_queue {
 #define QUEUE_FLAG_IO_STAT	7	/* do disk/partitions IO accounting */
 #define QUEUE_FLAG_NOXMERGES	9	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM	10	/* Contributes to random pool */
+#define QUEUE_FLAG_SYNCHRONOUS	11	/* always complets in submit context */
 #define QUEUE_FLAG_SAME_FORCE	12	/* force complete on same CPU */
 #define QUEUE_FLAG_INIT_DONE	14	/* queue is initialized */
 #define QUEUE_FLAG_STABLE_WRITES 15	/* don't modify blks until WB is done */
@@ -1252,6 +1253,12 @@ static inline bool bdev_nonrot(struct block_device *bdev)
 	return blk_queue_nonrot(bdev_get_queue(bdev));
 }
 
+static inline bool bdev_synchronous(struct block_device *bdev)
+{
+	return test_bit(QUEUE_FLAG_SYNCHRONOUS,
+			&bdev_get_queue(bdev)->queue_flags);
+}
+
 static inline bool bdev_stable_writes(struct block_device *bdev)
 {
 	return test_bit(QUEUE_FLAG_STABLE_WRITES,
@@ -1396,7 +1403,6 @@ struct block_device_operations {
 			unsigned int flags);
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
-	int (*rw_page)(struct block_device *, sector_t, struct page *, enum req_op);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	unsigned int (*check_events) (struct gendisk *disk,
@@ -1431,10 +1437,6 @@ extern int blkdev_compat_ptr_ioctl(struct block_device *, fmode_t,
 #define blkdev_compat_ptr_ioctl NULL
 #endif
 
-extern int bdev_read_page(struct block_device *, sector_t, struct page *);
-extern int bdev_write_page(struct block_device *, sector_t, struct page *,
-						struct writeback_control *);
-
 static inline void blk_wake_io_task(struct task_struct *waiter)
 {
 	/*
diff --git a/mm/page_io.c b/mm/page_io.c
index 2ee2bfe5de0386..69931fcbe7ce6e 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -28,7 +28,7 @@
 #include <linux/delayacct.h>
 #include "swap.h"
 
-static void end_swap_bio_write(struct bio *bio)
+static void __end_swap_bio_write(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
 
@@ -49,6 +49,11 @@ static void end_swap_bio_write(struct bio *bio)
 		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
+}
+
+static void end_swap_bio_write(struct bio *bio)
+{
+	__end_swap_bio_write(bio);
 	bio_put(bio);
 }
 
@@ -327,15 +332,31 @@ static void swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 		*wbc->swap_plug = sio;
 }
 
-static void swap_writepage_bdev(struct page *page,
+static void swap_writepage_bdev_sync(struct page *page,
 		struct writeback_control *wbc, struct swap_info_struct *sis)
 {
-	struct bio *bio;
+	struct bio_vec bv;
+	struct bio bio;
 
-	if (!bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc)) {
-		count_swpout_vm_event(page);
-		return;
-	}
+	bio_init(&bio, sis->bdev, &bv, 1,
+		 REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc));
+	bio.bi_iter.bi_sector = swap_page_sector(page);
+	bio_add_page(&bio, page, thp_size(page), 0);
+
+	bio_associate_blkg_from_page(&bio, page);
+	count_swpout_vm_event(page);
+
+	set_page_writeback(page);
+	unlock_page(page);
+
+	submit_bio_wait(&bio);
+	__end_swap_bio_write(&bio);
+}
+
+static void swap_writepage_bdev_async(struct page *page,
+		struct writeback_control *wbc, struct swap_info_struct *sis)
+{
+	struct bio *bio;
 
 	bio = bio_alloc(sis->bdev, 1,
 			REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc),
@@ -363,8 +384,10 @@ void __swap_writepage(struct page *page, struct writeback_control *wbc)
 	 */
 	if (data_race(sis->flags & SWP_FS_OPS))
 		swap_writepage_fs(page, wbc);
+	else if (sis->flags & SWP_SYNCHRONOUS_IO)
+		swap_writepage_bdev_sync(page, wbc, sis);
 	else
-		swap_writepage_bdev(page, wbc, sis);
+		swap_writepage_bdev_async(page, wbc, sis);
 }
 
 void swap_write_unplug(struct swap_iocb *sio)
@@ -448,12 +471,6 @@ static void swap_readpage_bdev_sync(struct page *page,
 	struct bio_vec bv;
 	struct bio bio;
 
-	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
-	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
-		count_vm_event(PSWPIN);
-		return;
-	}
-
 	bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_READ);
 	bio.bi_iter.bi_sector = swap_page_sector(page);
 	bio_add_page(&bio, page, thp_size(page), 0);
@@ -473,12 +490,6 @@ static void swap_readpage_bdev_async(struct page *page,
 {
 	struct bio *bio;
 
-	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
-	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
-		count_vm_event(PSWPIN);
-		return;
-	}
-
 	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
 	bio->bi_iter.bi_sector = swap_page_sector(page);
 	bio->bi_end_io = end_swap_bio_read;
@@ -514,7 +525,7 @@ void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
 		unlock_page(page);
 	} else if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_readpage_fs(page, plug);
-	} else if (synchronous) {
+	} else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) {
 		swap_readpage_bdev_sync(page, sis);
 	} else {
 		swap_readpage_bdev_async(page, sis);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 908a529bca12c9..dc61050a46d8ef 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3069,7 +3069,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_stable_writes(p->bdev))
 		p->flags |= SWP_STABLE_WRITES;
 
-	if (p->bdev && p->bdev->bd_disk->fops->rw_page)
+	if (p->bdev && bdev_synchronous(p->bdev))
 		p->flags |= SWP_SYNCHRONOUS_IO;
 
 	if (p->bdev && bdev_nonrot(p->bdev)) {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: remove ->rw_page
  2023-01-25 13:34 remove ->rw_page Christoph Hellwig
                   ` (6 preceding siblings ...)
  2023-01-25 13:34 ` [PATCH 7/7] block: remove ->rw_page Christoph Hellwig
@ 2023-01-25 14:32 ` Jens Axboe
  7 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-01-25 14:32 UTC (permalink / raw)
  To: Christoph Hellwig, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

On 1/25/23 6:34 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series removes the ->rw_page block_device_operation, which is an old
> and clumsy attempt at a simple read/write fast path for the block layer.
> It isn't actually used by the fastest block layer operations that we
> support (polled I/O through io_uring), but only used by the mpage buffered
> I/O helpers which are some of the slowest I/O we have and do not make any
> difference there at all, and zram which is a block device abused to
> duplicate the zram functionality.  Given that zram is heavily used we
> need to make sure there is a good replacement for synchronous I/O, so
> this series adds a new flag for drivers that complete I/O synchronously
> and uses that flag to use on-stack bios and synchronous submission for
> them in the swap code.

This is great, thanks for doing it. There's no reason for this weird
rw_page interface to exist.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/7] mm: remove the swap_readpage return value
  2023-01-25 13:34 ` [PATCH 2/7] mm: remove the swap_readpage return value Christoph Hellwig
@ 2023-01-25 15:58   ` Keith Busch
  2023-01-26  5:30     ` Christoph Hellwig
  2023-01-25 18:00   ` Dan Williams
  1 sibling, 1 reply; 17+ messages in thread
From: Keith Busch @ 2023-01-25 15:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny, Andrew Morton, linux-block,
	nvdimm, linux-fsdevel, linux-mm

On Wed, Jan 25, 2023 at 02:34:31PM +0100, Christoph Hellwig wrote:
> -static inline int swap_readpage(struct page *page, bool do_poll,
> -				struct swap_iocb **plug)
> +static inline void swap_readpage(struct page *page, bool do_poll,
> +		struct swap_iocb **plug)
>  {
>  	return 0;
>  }

Need to remove the 'return 0'.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/7] block: remove ->rw_page
  2023-01-25 13:34 ` [PATCH 7/7] block: remove ->rw_page Christoph Hellwig
@ 2023-01-25 16:28   ` Keith Busch
  2023-01-26  5:30     ` Christoph Hellwig
  2023-01-25 18:38   ` Dan Williams
  1 sibling, 1 reply; 17+ messages in thread
From: Keith Busch @ 2023-01-25 16:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Minchan Kim, Sergey Senozhatsky, Dan Williams,
	Vishal Verma, Dave Jiang, Ira Weiny, Andrew Morton, linux-block,
	nvdimm, linux-fsdevel, linux-mm

On Wed, Jan 25, 2023 at 02:34:36PM +0100, Christoph Hellwig wrote:
> @@ -363,8 +384,10 @@ void __swap_writepage(struct page *page, struct writeback_control *wbc)
>  	 */
>  	if (data_race(sis->flags & SWP_FS_OPS))
>  		swap_writepage_fs(page, wbc);
> +	else if (sis->flags & SWP_SYNCHRONOUS_IO)
> +		swap_writepage_bdev_sync(page, wbc, sis);

For an additional cleanup, it looks okay to remove the SWP_SYNCHRONOUS_IO flag
entirely and just check bdev_synchronous(sis->bdev)) directly instead.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/7] mpage: stop using bdev_{read,write}_page
  2023-01-25 13:34 ` [PATCH 1/7] mpage: stop using bdev_{read,write}_page Christoph Hellwig
@ 2023-01-25 17:58   ` Dan Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Dan Williams @ 2023-01-25 17:58 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Christoph Hellwig wrote:
> These are micro-optimizations for synchronous I/O, which do not matter
> compared to all the other inefficiencies in the legacy buffer_head
> based mpage code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/mpage.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 0f8ae954a57903..124550cfac4a70 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -269,11 +269,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>  
>  alloc_new:
>  	if (args->bio == NULL) {
> -		if (first_hole == blocks_per_page) {
> -			if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
> -								&folio->page))
> -				goto out;
> -		}
>  		args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf,
>  				      gfp);
>  		if (args->bio == NULL)
> @@ -579,11 +574,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  
>  alloc_new:
>  	if (bio == NULL) {
> -		if (first_unmapped == blocks_per_page) {
> -			if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
> -								page, wbc))
> -				goto out;
> -		}
>  		bio = bio_alloc(bdev, BIO_MAX_VECS,
>  				REQ_OP_WRITE | wbc_to_write_flags(wbc),
>  				GFP_NOFS);
> -- 
> 2.39.0
> 

Makes sense,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 2/7] mm: remove the swap_readpage return value
  2023-01-25 13:34 ` [PATCH 2/7] mm: remove the swap_readpage return value Christoph Hellwig
  2023-01-25 15:58   ` Keith Busch
@ 2023-01-25 18:00   ` Dan Williams
  1 sibling, 0 replies; 17+ messages in thread
From: Dan Williams @ 2023-01-25 18:00 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Christoph Hellwig wrote:
> swap_readpage always returns 0, and no caller checks the return value.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/page_io.c | 16 +++++-----------
>  mm/swap.h    |  7 +++----
>  2 files changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 3a5f921b932e82..6f7166fdc4b2bb 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -445,11 +445,9 @@ static void swap_readpage_fs(struct page *page,
>  		*plug = sio;
>  }
>  
> -int swap_readpage(struct page *page, bool synchronous,
> -		  struct swap_iocb **plug)
> +void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
>  {
>  	struct bio *bio;
> -	int ret = 0;
>  	struct swap_info_struct *sis = page_swap_info(page);
>  	bool workingset = PageWorkingset(page);
>  	unsigned long pflags;
> @@ -481,15 +479,12 @@ int swap_readpage(struct page *page, bool synchronous,
>  		goto out;
>  	}
>  
> -	if (sis->flags & SWP_SYNCHRONOUS_IO) {
> -		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
> -		if (!ret) {
> -			count_vm_event(PSWPIN);
> -			goto out;
> -		}
> +	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
> +	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
> +		count_vm_event(PSWPIN);
> +		goto out;
>  	}
>  
> -	ret = 0;
>  	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
>  	bio->bi_iter.bi_sector = swap_page_sector(page);
>  	bio->bi_end_io = end_swap_bio_read;
> @@ -521,7 +516,6 @@ int swap_readpage(struct page *page, bool synchronous,
>  		psi_memstall_leave(&pflags);
>  	}
>  	delayacct_swapin_end();
> -	return ret;
>  }
>  
>  void __swap_read_unplug(struct swap_iocb *sio)
> diff --git a/mm/swap.h b/mm/swap.h
> index f78065c8ef524b..f5eb5069d28c2e 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -8,8 +8,7 @@
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> -int swap_readpage(struct page *page, bool do_poll,
> -		  struct swap_iocb **plug);
> +void swap_readpage(struct page *page, bool do_poll, struct swap_iocb **plug);
>  void __swap_read_unplug(struct swap_iocb *plug);
>  static inline void swap_read_unplug(struct swap_iocb *plug)
>  {
> @@ -64,8 +63,8 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
>  }
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> -static inline int swap_readpage(struct page *page, bool do_poll,
> -				struct swap_iocb **plug)
> +static inline void swap_readpage(struct page *page, bool do_poll,
> +		struct swap_iocb **plug)
>  {
>  	return 0;
>  }
> -- 
> 2.39.0
> 

Looks correct,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 3/7] mm: factor out a swap_readpage_bdev helper
  2023-01-25 13:34 ` [PATCH 3/7] mm: factor out a swap_readpage_bdev helper Christoph Hellwig
@ 2023-01-25 18:30   ` Dan Williams
  0 siblings, 0 replies; 17+ messages in thread
From: Dan Williams @ 2023-01-25 18:30 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Christoph Hellwig wrote:
> Split the block device case from swap_readpage into a separate helper,
> following the abstraction for file based swap and frontswap.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/page_io.c | 68 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 35 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 6f7166fdc4b2bb..ce0b3638094f85 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -445,44 +445,15 @@ static void swap_readpage_fs(struct page *page,
>  		*plug = sio;
>  }
>  
> -void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
> +static void swap_readpage_bdev(struct page *page, bool synchronous,
> +		struct swap_info_struct *sis)
>  {
>  	struct bio *bio;
> -	struct swap_info_struct *sis = page_swap_info(page);
> -	bool workingset = PageWorkingset(page);
> -	unsigned long pflags;
> -	bool in_thrashing;
> -
> -	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
> -	VM_BUG_ON_PAGE(!PageLocked(page), page);
> -	VM_BUG_ON_PAGE(PageUptodate(page), page);
> -
> -	/*
> -	 * Count submission time as memory stall and delay. When the device
> -	 * is congested, or the submitting cgroup IO-throttled, submission
> -	 * can be a significant part of overall IO time.
> -	 */
> -	if (workingset) {
> -		delayacct_thrashing_start(&in_thrashing);
> -		psi_memstall_enter(&pflags);
> -	}
> -	delayacct_swapin_start();
> -
> -	if (frontswap_load(page) == 0) {
> -		SetPageUptodate(page);
> -		unlock_page(page);
> -		goto out;
> -	}
> -
> -	if (data_race(sis->flags & SWP_FS_OPS)) {
> -		swap_readpage_fs(page, plug);
> -		goto out;
> -	}
>  
>  	if ((sis->flags & SWP_SYNCHRONOUS_IO) &&
>  	    !bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
>  		count_vm_event(PSWPIN);
> -		goto out;
> +		return;
>  	}
>  
>  	bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
> @@ -509,8 +480,39 @@ void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
>  	}
>  	__set_current_state(TASK_RUNNING);
>  	bio_put(bio);
> +}
> +
> +void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug)
> +{
> +	struct swap_info_struct *sis = page_swap_info(page);
> +	bool workingset = PageWorkingset(page);
> +	unsigned long pflags;
> +	bool in_thrashing;
> +
> +	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(PageUptodate(page), page);
> +
> +	/*
> +	 * Count submission time as memory stall and delay. When the device
> +	 * is congested, or the submitting cgroup IO-throttled, submission
> +	 * can be a significant part of overall IO time.
> +	 */
> +	if (workingset) {
> +		delayacct_thrashing_start(&in_thrashing);
> +		psi_memstall_enter(&pflags);
> +	}
> +	delayacct_swapin_start();
> +
> +	if (frontswap_load(page) == 0) {
> +		SetPageUptodate(page);
> +		unlock_page(page);
> +	} else if (data_race(sis->flags & SWP_FS_OPS)) {
> +		swap_readpage_fs(page, plug);
> +	} else {
> +		swap_readpage_bdev(page, synchronous, sis);
> +	}
>  
> -out:
>  	if (workingset) {
>  		delayacct_thrashing_end(&in_thrashing);
>  		psi_memstall_leave(&pflags);

Looks good, passes tests,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 7/7] block: remove ->rw_page
  2023-01-25 13:34 ` [PATCH 7/7] block: remove ->rw_page Christoph Hellwig
  2023-01-25 16:28   ` Keith Busch
@ 2023-01-25 18:38   ` Dan Williams
  1 sibling, 0 replies; 17+ messages in thread
From: Dan Williams @ 2023-01-25 18:38 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny
  Cc: Andrew Morton, linux-block, nvdimm, linux-fsdevel, linux-mm

Christoph Hellwig wrote:
> The ->rw_page method is a special purpose bypass of the usual bio
> handling path that is limited to single-page reads and writes and
> synchronous which causes a lot of extra code in the drivers, callers
> and the block layer.
> 
> The only remaining user is the MM swap code.  Switch that swap code to
> simply submit a single-vec on-stack bio an synchronously wait on it
> based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers
> that currently implement ->rw_page instead.  While this touches one
> extra cache line and executes extra code, it simplifies the block
> layer and drivers and ensures that all feastures are properly supported
> by all drivers, e.g. right now ->rw_page bypassed cgroup writeback
> entirely.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/bdev.c                  | 78 -----------------------------------
>  drivers/block/brd.c           | 15 +------
>  drivers/block/zram/zram_drv.c | 61 +--------------------------
>  drivers/nvdimm/btt.c          | 16 +------
>  drivers/nvdimm/pmem.c         | 24 +----------
>  include/linux/blkdev.h        | 12 +++---
>  mm/page_io.c                  | 53 ++++++++++++++----------
>  mm/swapfile.c                 |  2 +-
>  8 files changed, 44 insertions(+), 217 deletions(-)
> 
[..]
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 89f51d68c68ad6..1bffe8f44ae9a8 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -555,6 +555,7 @@ struct request_queue {
>  #define QUEUE_FLAG_IO_STAT	7	/* do disk/partitions IO accounting */
>  #define QUEUE_FLAG_NOXMERGES	9	/* No extended merges */
>  #define QUEUE_FLAG_ADD_RANDOM	10	/* Contributes to random pool */
> +#define QUEUE_FLAG_SYNCHRONOUS	11	/* always complets in submit context */

s/complets/completes/

>  #define QUEUE_FLAG_SAME_FORCE	12	/* force complete on same CPU */
>  #define QUEUE_FLAG_INIT_DONE	14	/* queue is initialized */
>  #define QUEUE_FLAG_STABLE_WRITES 15	/* don't modify blks until WB is done */
> @@ -1252,6 +1253,12 @@ static inline bool bdev_nonrot(struct block_device *bdev)
>  	return blk_queue_nonrot(bdev_get_queue(bdev));
>  }
>  

Other than that, this looks good and passes regression:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/7] block: remove ->rw_page
  2023-01-25 16:28   ` Keith Busch
@ 2023-01-26  5:30     ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-26  5:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny, Andrew Morton,
	linux-block, nvdimm, linux-fsdevel, linux-mm

On Wed, Jan 25, 2023 at 09:28:33AM -0700, Keith Busch wrote:
> On Wed, Jan 25, 2023 at 02:34:36PM +0100, Christoph Hellwig wrote:
> > @@ -363,8 +384,10 @@ void __swap_writepage(struct page *page, struct writeback_control *wbc)
> >  	 */
> >  	if (data_race(sis->flags & SWP_FS_OPS))
> >  		swap_writepage_fs(page, wbc);
> > +	else if (sis->flags & SWP_SYNCHRONOUS_IO)
> > +		swap_writepage_bdev_sync(page, wbc, sis);
> 
> For an additional cleanup, it looks okay to remove the SWP_SYNCHRONOUS_IO flag
> entirely and just check bdev_synchronous(sis->bdev)) directly instead.

The swap code relatively consistently maps bdev flags to SWP_* flags,
including SWP_STABLE_WRITES, SWAP_FLAG_DISCARD and the somewhat misnamed
SWP_SOLIDSTATE.   So if we want to change that it's probably a separate
series.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/7] mm: remove the swap_readpage return value
  2023-01-25 15:58   ` Keith Busch
@ 2023-01-26  5:30     ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-01-26  5:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Minchan Kim, Sergey Senozhatsky,
	Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny, Andrew Morton,
	linux-block, nvdimm, linux-fsdevel, linux-mm

On Wed, Jan 25, 2023 at 08:58:31AM -0700, Keith Busch wrote:
> On Wed, Jan 25, 2023 at 02:34:31PM +0100, Christoph Hellwig wrote:
> > -static inline int swap_readpage(struct page *page, bool do_poll,
> > -				struct swap_iocb **plug)
> > +static inline void swap_readpage(struct page *page, bool do_poll,
> > +		struct swap_iocb **plug)
> >  {
> >  	return 0;
> >  }
> 
> Need to remove the 'return 0'.

Yes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-01-26  5:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-25 13:34 remove ->rw_page Christoph Hellwig
2023-01-25 13:34 ` [PATCH 1/7] mpage: stop using bdev_{read,write}_page Christoph Hellwig
2023-01-25 17:58   ` Dan Williams
2023-01-25 13:34 ` [PATCH 2/7] mm: remove the swap_readpage return value Christoph Hellwig
2023-01-25 15:58   ` Keith Busch
2023-01-26  5:30     ` Christoph Hellwig
2023-01-25 18:00   ` Dan Williams
2023-01-25 13:34 ` [PATCH 3/7] mm: factor out a swap_readpage_bdev helper Christoph Hellwig
2023-01-25 18:30   ` Dan Williams
2023-01-25 13:34 ` [PATCH 4/7] mm: use an on-stack bio for synchronous swapin Christoph Hellwig
2023-01-25 13:34 ` [PATCH 5/7] mm: remove the __swap_writepage return value Christoph Hellwig
2023-01-25 13:34 ` [PATCH 6/7] mm: factor out a swap_writepage_bdev helper Christoph Hellwig
2023-01-25 13:34 ` [PATCH 7/7] block: remove ->rw_page Christoph Hellwig
2023-01-25 16:28   ` Keith Busch
2023-01-26  5:30     ` Christoph Hellwig
2023-01-25 18:38   ` Dan Williams
2023-01-25 14:32 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.