All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-28 16:56 ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	Matthew Wilcox, Christoph Hellwig, Minchan Kim, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

Dan Williams and Christoph Hellwig have recently expressed doubt about
whether the rw_page() interface made sense for synchronous memory drivers
[1][2].  It's unclear whether this interface has any performance benefit
for these drivers, but as we continue to fix bugs it is clear that it does
have a maintenance burden.  This series removes the rw_page()
implementations in brd, pmem and btt to relieve this burden.

The last existing user of the rw_page interface is the zram driver, and
according to the changelog for the patch that added zram_rw_page() that
driver does see a clear performance gain:

  I implemented the feature in zram and tested it.  Test bed was the G2, LG
  electronic mobile device, whtich has msm8974 processor and 2GB memory.

  With a memory allocation test program consuming memory, the system
  generates swap.

  Operating time of swap_write_page() was measured.

  --------------------------------------------------
  |             |   operating time   | improvement |
  |             |  (20 runs average) |             |
  --------------------------------------------------
  |with patch   |    1061.15 us      |    +2.4%    |
  --------------------------------------------------
  |without patch|    1087.35 us      |             |
  --------------------------------------------------

  Each test(with paged_io,with BIO) result set shows normal distribution
  and has equal variance.  I mean the two values are valid result to
  compare.  I can say operation with paged I/O(without BIO) is faster 2.4%
  with confidence level 95%.

These patches have passed ext4 and XFS xfstest regression testing with
a memory mode pmem driver (without DAX), with pmem + btt and with brd.

These patches apply cleanly to the current v4.13-rc2 based linux/master.

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-July/011389.html
[2] https://www.mail-archive.com/linux-block@vger.kernel.org/msg11170.html

Ross Zwisler (3):
  btt: remove btt_rw_page()
  pmem: remove pmem_rw_page()
  brd: remove brd_rw_page()

 drivers/block/brd.c   | 10 ----------
 drivers/nvdimm/btt.c  | 15 ---------------
 drivers/nvdimm/pmem.c | 21 ---------------------
 3 files changed, 46 deletions(-)

-- 
2.9.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-28 16:56 ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, karam . lee, Minchan Kim, Jerome Marchand,
	Nitin Gupta, seungho1.park, Matthew Wilcox, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

Dan Williams and Christoph Hellwig have recently expressed doubt about
whether the rw_page() interface made sense for synchronous memory drivers
[1][2].  It's unclear whether this interface has any performance benefit
for these drivers, but as we continue to fix bugs it is clear that it does
have a maintenance burden.  This series removes the rw_page()
implementations in brd, pmem and btt to relieve this burden.

The last existing user of the rw_page interface is the zram driver, and
according to the changelog for the patch that added zram_rw_page() that
driver does see a clear performance gain:

  I implemented the feature in zram and tested it.  Test bed was the G2, LG
  electronic mobile device, whtich has msm8974 processor and 2GB memory.

  With a memory allocation test program consuming memory, the system
  generates swap.

  Operating time of swap_write_page() was measured.

  --------------------------------------------------
  |             |   operating time   | improvement |
  |             |  (20 runs average) |             |
  --------------------------------------------------
  |with patch   |    1061.15 us      |    +2.4%    |
  --------------------------------------------------
  |without patch|    1087.35 us      |             |
  --------------------------------------------------

  Each test(with paged_io,with BIO) result set shows normal distribution
  and has equal variance.  I mean the two values are valid result to
  compare.  I can say operation with paged I/O(without BIO) is faster 2.4%
  with confidence level 95%.

These patches have passed ext4 and XFS xfstest regression testing with
a memory mode pmem driver (without DAX), with pmem + btt and with brd.

These patches apply cleanly to the current v4.13-rc2 based linux/master.

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-July/011389.html
[2] https://www.mail-archive.com/linux-block@vger.kernel.org/msg11170.html

Ross Zwisler (3):
  btt: remove btt_rw_page()
  pmem: remove pmem_rw_page()
  brd: remove brd_rw_page()

 drivers/block/brd.c   | 10 ----------
 drivers/nvdimm/btt.c  | 15 ---------------
 drivers/nvdimm/pmem.c | 21 ---------------------
 3 files changed, 46 deletions(-)

-- 
2.9.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3] btt: remove btt_rw_page()
  2017-07-28 16:56 ` Ross Zwisler
@ 2017-07-28 16:56   ` Ross Zwisler
  -1 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	Matthew Wilcox, Christoph Hellwig, Minchan Kim, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

The rw_page() interface doesn't provide a clear performance benefit for the
BTT and has had a nonzero maintenance burden, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
---
 drivers/nvdimm/btt.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 14323fa..e10d330 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1236,20 +1236,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-static int btt_rw_page(struct block_device *bdev, sector_t sector,
-		struct page *page, bool is_write)
-{
-	struct btt *btt = bdev->bd_disk->private_data;
-	int rc;
-
-	rc = btt_do_bvec(btt, NULL, page, PAGE_SIZE, 0, is_write, sector);
-	if (rc == 0)
-		page_endio(page, is_write, 0);
-
-	return rc;
-}
-
-
 static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
 	/* some standard values */
@@ -1261,7 +1247,6 @@ static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 
 static const struct block_device_operations btt_fops = {
 	.owner =		THIS_MODULE,
-	.rw_page =		btt_rw_page,
 	.getgeo =		btt_getgeo,
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
-- 
2.9.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 1/3] btt: remove btt_rw_page()
@ 2017-07-28 16:56   ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, karam . lee, Minchan Kim, Jerome Marchand,
	Nitin Gupta, seungho1.park, Matthew Wilcox, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

The rw_page() interface doesn't provide a clear performance benefit for the
BTT and has had a nonzero maintenance burden, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
---
 drivers/nvdimm/btt.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 14323fa..e10d330 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1236,20 +1236,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-static int btt_rw_page(struct block_device *bdev, sector_t sector,
-		struct page *page, bool is_write)
-{
-	struct btt *btt = bdev->bd_disk->private_data;
-	int rc;
-
-	rc = btt_do_bvec(btt, NULL, page, PAGE_SIZE, 0, is_write, sector);
-	if (rc == 0)
-		page_endio(page, is_write, 0);
-
-	return rc;
-}
-
-
 static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
 	/* some standard values */
@@ -1261,7 +1247,6 @@ static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo)
 
 static const struct block_device_operations btt_fops = {
 	.owner =		THIS_MODULE,
-	.rw_page =		btt_rw_page,
 	.getgeo =		btt_getgeo,
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] pmem: remove pmem_rw_page()
  2017-07-28 16:56 ` Ross Zwisler
@ 2017-07-28 16:56   ` Ross Zwisler
  -1 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	Matthew Wilcox, Christoph Hellwig, Minchan Kim, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

The rw_page() interface doesn't provide a clear performance benefit for
PMEM and has had a nonzero maintenance burden, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
---
 drivers/nvdimm/pmem.c | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7099ada..f23c82d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -182,26 +182,6 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-static int pmem_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, bool is_write)
-{
-	struct pmem_device *pmem = bdev->bd_queue->queuedata;
-	blk_status_t rc;
-
-	rc = pmem_do_bvec(pmem, page, PAGE_SIZE, 0, is_write, sector);
-
-	/*
-	 * The ->rw_page interface is subtle and tricky.  The core
-	 * retries on any error, so we can only invoke page_endio() in
-	 * the successful completion case.  Otherwise, we'll see crashes
-	 * caused by double completion.
-	 */
-	if (rc == 0)
-		page_endio(page, is_write, 0);
-
-	return blk_status_to_errno(rc);
-}
-
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn)
@@ -225,7 +205,6 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
-	.rw_page =		pmem_rw_page,
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
-- 
2.9.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] pmem: remove pmem_rw_page()
@ 2017-07-28 16:56   ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, karam . lee, Minchan Kim, Jerome Marchand,
	Nitin Gupta, seungho1.park, Matthew Wilcox, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

The rw_page() interface doesn't provide a clear performance benefit for
PMEM and has had a nonzero maintenance burden, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
---
 drivers/nvdimm/pmem.c | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7099ada..f23c82d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -182,26 +182,6 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-static int pmem_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, bool is_write)
-{
-	struct pmem_device *pmem = bdev->bd_queue->queuedata;
-	blk_status_t rc;
-
-	rc = pmem_do_bvec(pmem, page, PAGE_SIZE, 0, is_write, sector);
-
-	/*
-	 * The ->rw_page interface is subtle and tricky.  The core
-	 * retries on any error, so we can only invoke page_endio() in
-	 * the successful completion case.  Otherwise, we'll see crashes
-	 * caused by double completion.
-	 */
-	if (rc == 0)
-		page_endio(page, is_write, 0);
-
-	return blk_status_to_errno(rc);
-}
-
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn)
@@ -225,7 +205,6 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
-	.rw_page =		pmem_rw_page,
 	.revalidate_disk =	nvdimm_revalidate_disk,
 };
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/3] brd: remove brd_rw_page()
  2017-07-28 16:56 ` Ross Zwisler
@ 2017-07-28 16:56   ` Ross Zwisler
  -1 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	Matthew Wilcox, Christoph Hellwig, Minchan Kim, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

The rw_page() interface doesn't provide a clear performance benefit for
BRD and has had a nonzero maintenance burden, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
---
 drivers/block/brd.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 104b71c..29325058 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -322,15 +322,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-static int brd_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, bool is_write)
-{
-	struct brd_device *brd = bdev->bd_disk->private_data;
-	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
-	page_endio(page, is_write, err);
-	return err;
-}
-
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn)
@@ -370,7 +361,6 @@ static const struct dax_operations brd_dax_ops = {
 
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
-	.rw_page =		brd_rw_page,
 };
 
 /*
-- 
2.9.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/3] brd: remove brd_rw_page()
@ 2017-07-28 16:56   ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-07-28 16:56 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, karam . lee, Minchan Kim, Jerome Marchand,
	Nitin Gupta, seungho1.park, Matthew Wilcox, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

The rw_page() interface doesn't provide a clear performance benefit for
BRD and has had a nonzero maintenance burden, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
---
 drivers/block/brd.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 104b71c..29325058 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -322,15 +322,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
-static int brd_rw_page(struct block_device *bdev, sector_t sector,
-		       struct page *page, bool is_write)
-{
-	struct brd_device *brd = bdev->bd_disk->private_data;
-	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
-	page_endio(page, is_write, err);
-	return err;
-}
-
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn)
@@ -370,7 +361,6 @@ static const struct dax_operations brd_dax_ops = {
 
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
-	.rw_page =		brd_rw_page,
 };
 
 /*
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-28 16:56 ` Ross Zwisler
@ 2017-07-28 17:31   ` Matthew Wilcox
  -1 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2017-07-28 17:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Christoph Hellwig, Minchan Kim, Jan Kara,
	Andrew Morton, karam . lee, seungho1.park, Nitin Gupta

On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> Dan Williams and Christoph Hellwig have recently expressed doubt about
> whether the rw_page() interface made sense for synchronous memory drivers
> [1][2].  It's unclear whether this interface has any performance benefit
> for these drivers, but as we continue to fix bugs it is clear that it does
> have a maintenance burden.  This series removes the rw_page()
> implementations in brd, pmem and btt to relieve this burden.

Why don't you measure whether it has performance benefits?  I don't
understand why zram would see performance benefits and not other drivers.
If it's going to be removed, then the whole interface should be removed,
not just have the implementations removed from some drivers.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-28 17:31   ` Matthew Wilcox
  0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2017-07-28 17:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, karam . lee, Minchan Kim,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> Dan Williams and Christoph Hellwig have recently expressed doubt about
> whether the rw_page() interface made sense for synchronous memory drivers
> [1][2].  It's unclear whether this interface has any performance benefit
> for these drivers, but as we continue to fix bugs it is clear that it does
> have a maintenance burden.  This series removes the rw_page()
> implementations in brd, pmem and btt to relieve this burden.

Why don't you measure whether it has performance benefits?  I don't
understand why zram would see performance benefits and not other drivers.
If it's going to be removed, then the whole interface should be removed,
not just have the implementations removed from some drivers.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-28 17:31   ` Matthew Wilcox
@ 2017-07-28 21:21     ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2017-07-28 21:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Christoph Hellwig, Minchan Kim, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > whether the rw_page() interface made sense for synchronous memory drivers
> > [1][2].  It's unclear whether this interface has any performance benefit
> > for these drivers, but as we continue to fix bugs it is clear that it does
> > have a maintenance burden.  This series removes the rw_page()
> > implementations in brd, pmem and btt to relieve this burden.
> 
> Why don't you measure whether it has performance benefits?  I don't
> understand why zram would see performance benefits and not other drivers.
> If it's going to be removed, then the whole interface should be removed,
> not just have the implementations removed from some drivers.

Yes please.  Minchan, could you please take a look sometime?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-28 21:21     ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2017-07-28 21:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, linux-kernel, karam . lee, Minchan Kim,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > whether the rw_page() interface made sense for synchronous memory drivers
> > [1][2].  It's unclear whether this interface has any performance benefit
> > for these drivers, but as we continue to fix bugs it is clear that it does
> > have a maintenance burden.  This series removes the rw_page()
> > implementations in brd, pmem and btt to relieve this burden.
> 
> Why don't you measure whether it has performance benefits?  I don't
> understand why zram would see performance benefits and not other drivers.
> If it's going to be removed, then the whole interface should be removed,
> not just have the implementations removed from some drivers.

Yes please.  Minchan, could you please take a look sometime?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-28 21:21     ` Andrew Morton
@ 2017-07-30 22:16       ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-07-30 22:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

Hi Andrew,

On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Yes please.  Minchan, could you please take a look sometime?

rw_page's gain is reducing of dynamic allocation in swap path
as well as performance gain thorugh avoiding bio allocation.
And it would be important in memory pressure situation.

I guess it comes from bio_alloc mempool. Usually, zram-swap works
in high memory pressure so mempool would be exahusted easily.
It means that mempool wait and repeated alloc would consume the
overhead.

Actually, at that time although Karam reported the gain is 2.4%,
I got a report from production team that the gain in corner case
(e.g., animation playing is smooth) would be much higher than
expected.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-30 22:16       ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-07-30 22:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Ross Zwisler, linux-kernel, karam . lee,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

Hi Andrew,

On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Yes please.  Minchan, could you please take a look sometime?

rw_page's gain is reducing of dynamic allocation in swap path
as well as performance gain thorugh avoiding bio allocation.
And it would be important in memory pressure situation.

I guess it comes from bio_alloc mempool. Usually, zram-swap works
in high memory pressure so mempool would be exahusted easily.
It means that mempool wait and repeated alloc would consume the
overhead.

Actually, at that time although Karam reported the gain is 2.4%,
I got a report from production team that the gain in corner case
(e.g., animation playing is smooth) would be much higher than
expected.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-30 22:16       ` Minchan Kim
@ 2017-07-30 22:38         ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-07-30 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Jan Kara,
	karam . lee, seungho1.park, Nitin Gupta

On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> Hi Andrew,
> 
> On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> > On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Yes please.  Minchan, could you please take a look sometime?
> 
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.
> 
> I guess it comes from bio_alloc mempool. Usually, zram-swap works
> in high memory pressure so mempool would be exahusted easily.
> It means that mempool wait and repeated alloc would consume the
> overhead.
> 
> Actually, at that time although Karam reported the gain is 2.4%,
> I got a report from production team that the gain in corner case
> (e.g., animation playing is smooth) would be much higher than
> expected.

One of the idea is to create bioset only for swap without sharing
with FS so bio allocation for swap doesn't need to wait returning
bio from FS side which does slow nand IO to mempool.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-30 22:38         ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-07-30 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Ross Zwisler, linux-kernel, karam . lee,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> Hi Andrew,
> 
> On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> > On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Yes please.  Minchan, could you please take a look sometime?
> 
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.
> 
> I guess it comes from bio_alloc mempool. Usually, zram-swap works
> in high memory pressure so mempool would be exahusted easily.
> It means that mempool wait and repeated alloc would consume the
> overhead.
> 
> Actually, at that time although Karam reported the gain is 2.4%,
> I got a report from production team that the gain in corner case
> (e.g., animation playing is smooth) would be much higher than
> expected.

One of the idea is to create bioset only for swap without sharing
with FS so bio allocation for swap doesn't need to wait returning
bio from FS side which does slow nand IO to mempool.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-30 22:16       ` Minchan Kim
@ 2017-07-31  7:17         ` Christoph Hellwig
  -1 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-07-31  7:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, seungho1.park,
	Jan Kara, Andrew Morton, karam . lee, Nitin Gupta

On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.

There is no need for any dynamic allocation when using the bio
path.  Take a look at __blkdev_direct_IO_simple for an example
that doesn't do any allocations.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-31  7:17         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-07-31  7:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Matthew Wilcox, Ross Zwisler, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dan Williams, Dave Chinner, Jan Kara,
	Jens Axboe, Vishal Verma, linux-nvdimm

On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.

There is no need for any dynamic allocation when using the bio
path.  Take a look at __blkdev_direct_IO_simple for an example
that doesn't do any allocations.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-31  7:17         ` Christoph Hellwig
@ 2017-07-31  7:36           ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-07-31  7:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, seungho1.park, Jan Kara,
	Andrew Morton, karam . lee, Nitin Gupta

On Mon, Jul 31, 2017 at 09:17:07AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> > rw_page's gain is reducing of dynamic allocation in swap path
> > as well as performance gain thorugh avoiding bio allocation.
> > And it would be important in memory pressure situation.
> 
> There is no need for any dynamic allocation when using the bio
> path.  Take a look at __blkdev_direct_IO_simple for an example
> that doesn't do any allocations.

Do you suggest define something special flag(e.g., SWP_INMEMORY)
for in-memory swap to swap_info_struct when swapon time manually
or from bdi_queue_someting automatically?
And depending the flag of swap_info_struct, use the onstack bio
instead of dynamic allocation if the swap device is in-memory?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-31  7:36           ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-07-31  7:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Matthew Wilcox, Ross Zwisler, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Mon, Jul 31, 2017 at 09:17:07AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> > rw_page's gain is reducing of dynamic allocation in swap path
> > as well as performance gain thorugh avoiding bio allocation.
> > And it would be important in memory pressure situation.
> 
> There is no need for any dynamic allocation when using the bio
> path.  Take a look at __blkdev_direct_IO_simple for an example
> that doesn't do any allocations.

Do you suggest define something special flag(e.g., SWP_INMEMORY)
for in-memory swap to swap_info_struct when swapon time manually
or from bdi_queue_someting automatically?
And depending the flag of swap_info_struct, use the onstack bio
instead of dynamic allocation if the swap device is in-memory?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-31  7:36           ` Minchan Kim
@ 2017-07-31  7:42             ` Christoph Hellwig
  -1 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-07-31  7:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, seungho1.park, Jan Kara,
	Andrew Morton, karam . lee, Christoph Hellwig, Nitin Gupta

On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> Do you suggest define something special flag(e.g., SWP_INMEMORY)
> for in-memory swap to swap_info_struct when swapon time manually
> or from bdi_queue_someting automatically?
> And depending the flag of swap_info_struct, use the onstack bio
> instead of dynamic allocation if the swap device is in-memory?

Currently swap always just does I/O on a single page as far
as I can tell, so it can always just use an on-stack bio and
biovec.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-31  7:42             ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-07-31  7:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Christoph Hellwig, Andrew Morton, Matthew Wilcox, Ross Zwisler,
	linux-kernel, karam . lee, Jerome Marchand, Nitin Gupta,
	seungho1.park, Dan Williams, Dave Chinner, Jan Kara, Jens Axboe,
	Vishal Verma, linux-nvdimm

On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> Do you suggest define something special flag(e.g., SWP_INMEMORY)
> for in-memory swap to swap_info_struct when swapon time manually
> or from bdi_queue_someting automatically?
> And depending the flag of swap_info_struct, use the onstack bio
> instead of dynamic allocation if the swap device is in-memory?

Currently swap always just does I/O on a single page as far
as I can tell, so it can always just use an on-stack bio and
biovec.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-31  7:42             ` Christoph Hellwig
@ 2017-07-31  7:44               ` Christoph Hellwig
  -1 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-07-31  7:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, seungho1.park, Jan Kara,
	Andrew Morton, karam . lee, Christoph Hellwig, Nitin Gupta

On Mon, Jul 31, 2017 at 09:42:06AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> > Do you suggest define something special flag(e.g., SWP_INMEMORY)
> > for in-memory swap to swap_info_struct when swapon time manually
> > or from bdi_queue_someting automatically?
> > And depending the flag of swap_info_struct, use the onstack bio
> > instead of dynamic allocation if the swap device is in-memory?
> 
> Currently swap always just does I/O on a single page as far
> as I can tell, so it can always just use an on-stack bio and
> biovec.

That's for synchronous I/O, aka reads of course.  For writes you'll
need to do a dynamic allocation if they are asynchronous.  But yes,
if we want to force certain devices to be synchronous we'll need
a flag for that.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-07-31  7:44               ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-07-31  7:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Christoph Hellwig, Andrew Morton, Matthew Wilcox, Ross Zwisler,
	linux-kernel, karam . lee, Jerome Marchand, Nitin Gupta,
	seungho1.park, Dan Williams, Dave Chinner, Jan Kara, Jens Axboe,
	Vishal Verma, linux-nvdimm

On Mon, Jul 31, 2017 at 09:42:06AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> > Do you suggest define something special flag(e.g., SWP_INMEMORY)
> > for in-memory swap to swap_info_struct when swapon time manually
> > or from bdi_queue_someting automatically?
> > And depending the flag of swap_info_struct, use the onstack bio
> > instead of dynamic allocation if the swap device is in-memory?
> 
> Currently swap always just does I/O on a single page as far
> as I can tell, so it can always just use an on-stack bio and
> biovec.

That's for synchronous I/O, aka reads of course.  For writes you'll
need to do a dynamic allocation if they are asynchronous.  But yes,
if we want to force certain devices to be synchronous we'll need
a flag for that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-31  7:44               ` Christoph Hellwig
@ 2017-08-01  6:23                 ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-01  6:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, seungho1.park, Jan Kara,
	Andrew Morton, karam . lee, Nitin Gupta

On Mon, Jul 31, 2017 at 09:44:04AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 09:42:06AM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> > > Do you suggest define something special flag(e.g., SWP_INMEMORY)
> > > for in-memory swap to swap_info_struct when swapon time manually
> > > or from bdi_queue_someting automatically?
> > > And depending the flag of swap_info_struct, use the onstack bio
> > > instead of dynamic allocation if the swap device is in-memory?
> > 
> > Currently swap always just does I/O on a single page as far
> > as I can tell, so it can always just use an on-stack bio and
> > biovec.
> 
> That's for synchronous I/O, aka reads of course.  For writes you'll
> need to do a dynamic allocation if they are asynchronous.  But yes,
> if we want to force certain devices to be synchronous we'll need
> a flag for that.

Okay, I will look into that.
Thanks for the suggestion, Christoph.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-01  6:23                 ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-01  6:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Matthew Wilcox, Ross Zwisler, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Mon, Jul 31, 2017 at 09:44:04AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 09:42:06AM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> > > Do you suggest define something special flag(e.g., SWP_INMEMORY)
> > > for in-memory swap to swap_info_struct when swapon time manually
> > > or from bdi_queue_someting automatically?
> > > And depending the flag of swap_info_struct, use the onstack bio
> > > instead of dynamic allocation if the swap device is in-memory?
> > 
> > Currently swap always just does I/O on a single page as far
> > as I can tell, so it can always just use an on-stack bio and
> > biovec.
> 
> That's for synchronous I/O, aka reads of course.  For writes you'll
> need to do a dynamic allocation if they are asynchronous.  But yes,
> if we want to force certain devices to be synchronous we'll need
> a flag for that.

Okay, I will look into that.
Thanks for the suggestion, Christoph.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-07-28 17:31   ` Matthew Wilcox
@ 2017-08-02 22:13     ` Ross Zwisler
  -1 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-08-02 22:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Christoph Hellwig, Minchan Kim, seungho1.park,
	Jan Kara, karam . lee, Andrew Morton, Nitin Gupta

On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > whether the rw_page() interface made sense for synchronous memory drivers
> > [1][2].  It's unclear whether this interface has any performance benefit
> > for these drivers, but as we continue to fix bugs it is clear that it does
> > have a maintenance burden.  This series removes the rw_page()
> > implementations in brd, pmem and btt to relieve this burden.
> 
> Why don't you measure whether it has performance benefits?  I don't
> understand why zram would see performance benefits and not other drivers.
> If it's going to be removed, then the whole interface should be removed,
> not just have the implementations removed from some drivers.

Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
points for rw_pages() in a swap workload, and in all cases I do see an
improvement over the code when rw_pages() is removed.  Here are the results
from my random lab box:

  Average latency of swap_writepage()
+------+------------+---------+-------------+
|      | no rw_page | rw_page | Improvement |
+-------------------------------------------+
| PMEM |  5.0 us    |  4.7 us |     6%      |
+-------------------------------------------+
|  BTT |  6.8 us    |  6.1 us |    10%      |
+------+------------+---------+-------------+

  Average latency of swap_readpage()
+------+------------+---------+-------------+
|      | no rw_page | rw_page | Improvement |
+-------------------------------------------+
| PMEM |  3.3 us    |  2.9 us |    12%      |
+-------------------------------------------+
|  BTT |  3.7 us    |  3.4 us |     8%      |
+------+------------+---------+-------------+

The workload was pmbench, a memory benchmark, run on a system where I had
severely restricted the amount of memory in the system with the 'mem' kernel
command line parameter.  The benchmark was set up to test more memory than I
allowed the OS to have so it spilled over into swap.

The PMEM or BTT device was set up as my swap device, and during the test I got
a few hundred thousand samples of each of swap_writepage() and
swap_writepage().  The PMEM/BTT device was just memory reserved with the
memmap kernel command line parameter.

Thanks, Matthew, for asking for performance data.  It looks like removing this
code would have been a mistake.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-02 22:13     ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-08-02 22:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, karam . lee,
	Minchan Kim, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dan Williams, Dave Chinner, Jan Kara,
	Jens Axboe, Vishal Verma, linux-nvdimm

On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > whether the rw_page() interface made sense for synchronous memory drivers
> > [1][2].  It's unclear whether this interface has any performance benefit
> > for these drivers, but as we continue to fix bugs it is clear that it does
> > have a maintenance burden.  This series removes the rw_page()
> > implementations in brd, pmem and btt to relieve this burden.
> 
> Why don't you measure whether it has performance benefits?  I don't
> understand why zram would see performance benefits and not other drivers.
> If it's going to be removed, then the whole interface should be removed,
> not just have the implementations removed from some drivers.

Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
points for rw_pages() in a swap workload, and in all cases I do see an
improvement over the code when rw_pages() is removed.  Here are the results
from my random lab box:

  Average latency of swap_writepage()
+------+------------+---------+-------------+
|      | no rw_page | rw_page | Improvement |
+-------------------------------------------+
| PMEM |  5.0 us    |  4.7 us |     6%      |
+-------------------------------------------+
|  BTT |  6.8 us    |  6.1 us |    10%      |
+------+------------+---------+-------------+

  Average latency of swap_readpage()
+------+------------+---------+-------------+
|      | no rw_page | rw_page | Improvement |
+-------------------------------------------+
| PMEM |  3.3 us    |  2.9 us |    12%      |
+-------------------------------------------+
|  BTT |  3.7 us    |  3.4 us |     8%      |
+------+------------+---------+-------------+

The workload was pmbench, a memory benchmark, run on a system where I had
severely restricted the amount of memory in the system with the 'mem' kernel
command line parameter.  The benchmark was set up to test more memory than I
allowed the OS to have so it spilled over into swap.

The PMEM or BTT device was set up as my swap device, and during the test I got
a few hundred thousand samples of each of swap_writepage() and
swap_writepage().  The PMEM/BTT device was just memory reserved with the
memmap kernel command line parameter.

Thanks, Matthew, for asking for performance data.  It looks like removing this
code would have been a mistake.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-02 22:13     ` Ross Zwisler
@ 2017-08-03  0:13       ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-03  0:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Jan Kara,
	Andrew Morton, karam . lee, seungho1.park, Nitin Gupta

Hi Ross,

On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> points for rw_pages() in a swap workload, and in all cases I do see an
> improvement over the code when rw_pages() is removed.  Here are the results
> from my random lab box:
> 
>   Average latency of swap_writepage()
> +------+------------+---------+-------------+
> |      | no rw_page | rw_page | Improvement |
> +-------------------------------------------+
> | PMEM |  5.0 us    |  4.7 us |     6%      |
> +-------------------------------------------+
> |  BTT |  6.8 us    |  6.1 us |    10%      |
> +------+------------+---------+-------------+
> 
>   Average latency of swap_readpage()
> +------+------------+---------+-------------+
> |      | no rw_page | rw_page | Improvement |
> +-------------------------------------------+
> | PMEM |  3.3 us    |  2.9 us |    12%      |
> +-------------------------------------------+
> |  BTT |  3.7 us    |  3.4 us |     8%      |
> +------+------------+---------+-------------+
> 
> The workload was pmbench, a memory benchmark, run on a system where I had
> severely restricted the amount of memory in the system with the 'mem' kernel
> command line parameter.  The benchmark was set up to test more memory than I
> allowed the OS to have so it spilled over into swap.
> 
> The PMEM or BTT device was set up as my swap device, and during the test I got
> a few hundred thousand samples of each of swap_writepage() and
> swap_writepage().  The PMEM/BTT device was just memory reserved with the
> memmap kernel command line parameter.
> 
> Thanks, Matthew, for asking for performance data.  It looks like removing this
> code would have been a mistake.

By suggestion of Christoph Hellwig, I made a quick patch which does IO without
dynamic bio allocation for swap IO. Actually, it's not formal patch to be
worth to send mainline yet but I believe it's enough to test the improvement.

Could you test patchset on pmem and btt without rw_page?

For working the patch, block drivers need to declare it's synchronous IO
device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
comes from (sis->flags & SWP_SYNC_IO) with removing condition check

if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.

Patchset is based on 4.13-rc3.


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 856d5dc02451..b1c5e9bf3ad5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -125,9 +125,9 @@ static inline bool is_partial_io(struct bio_vec *bvec)
 static void zram_revalidate_disk(struct zram *zram)
 {
 	revalidate_disk(zram->disk);
-	/* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
+	/* revalidate_disk reset the BDI capability so set again */
 	zram->disk->queue->backing_dev_info->capabilities |=
-		BDI_CAP_STABLE_WRITES;
+		(BDI_CAP_STABLE_WRITES|BDI_CAP_SYNC);
 }
 
 /*
@@ -1096,7 +1096,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
 static const struct block_device_operations zram_devops = {
 	.open = zram_open,
 	.swap_slot_free_notify = zram_slot_free_notify,
-	.rw_page = zram_rw_page,
+	// .rw_page = zram_rw_page,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..05eee145d964 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -130,6 +130,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_STABLE_WRITES	0x00000008
 #define BDI_CAP_STRICTLIMIT	0x00000010
 #define BDI_CAP_CGROUP_WRITEBACK 0x00000020
+#define BDI_CAP_SYNC		0x00000040
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
 	(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -177,6 +178,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
+static inline bool bdi_cap_sync_io_required(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_SYNC;
+}
+
 static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
 {
 	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d83d28e53e62..86457dbfd300 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -152,8 +152,9 @@ enum {
 	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
+	SWP_SYNC_IO	= (1 << 11),
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 11),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index b6c4ac388209..2c85e5182364 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -263,7 +263,6 @@ static sector_t swap_page_sector(struct page *page)
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
-	struct bio *bio;
 	int ret;
 	struct swap_info_struct *sis = page_swap_info(page);
 
@@ -316,25 +315,44 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_NOIO, page, end_write_func);
-	if (bio == NULL) {
-		set_page_dirty(page);
+	count_vm_event(PSWPOUT);
+
+	if (!(sis->flags & SWP_SYNC_IO)) {
+		struct bio *bio;
+
+		bio = get_swap_bio(GFP_NOIO, page, end_write_func);
+		if (bio == NULL) {
+			set_page_dirty(page);
+			unlock_page(page);
+			ret = -ENOMEM;
+			goto out;
+		}
+		bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+		set_page_writeback(page);
 		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
+		submit_bio(bio);
+	} else {
+		struct bio bio;
+		struct bio_vec bvec;
+
+		bio_init(&bio, &bvec, 1);
+
+		bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
+		bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
+		bio.bi_end_io = end_write_func;
+		bio_add_page(&bio, page, PAGE_SIZE, 0);
+		bio.bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+		bio_get(&bio);
+		set_page_writeback(page);
+		unlock_page(page);
+		submit_bio(&bio);
 	}
-	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
-	count_vm_event(PSWPOUT);
-	set_page_writeback(page);
-	unlock_page(page);
-	submit_bio(bio);
 out:
 	return ret;
 }
 
 int swap_readpage(struct page *page, bool do_poll)
 {
-	struct bio *bio;
 	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
 	blk_qc_t qc;
@@ -371,29 +389,49 @@ int swap_readpage(struct page *page, bool do_poll)
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
-	if (bio == NULL) {
-		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
-	}
-	bdev = bio->bi_bdev;
-	bio->bi_private = current;
-	bio_set_op_attrs(bio, REQ_OP_READ, 0);
-	count_vm_event(PSWPIN);
-	bio_get(bio);
-	qc = submit_bio(bio);
-	while (do_poll) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio->bi_private))
-			break;
-
-		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
-			break;
+	if (!(sis->flags & SWP_SYNC_IO)) {
+		struct bio *bio;
+
+		bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
+		if (bio == NULL) {
+			unlock_page(page);
+			ret = -ENOMEM;
+			goto out;
+		}
+		bdev = bio->bi_bdev;
+		bio->bi_private = current;
+		bio_set_op_attrs(bio, REQ_OP_READ, 0);
+		bio_get(bio);
+		qc = submit_bio(bio);
+		while (do_poll) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(bio->bi_private))
+				break;
+
+			if (!blk_mq_poll(bdev_get_queue(bdev), qc))
+				break;
+		}
+		__set_current_state(TASK_RUNNING);
+		bio_put(bio);
+	} else {
+		struct bio bio;
+		struct bio_vec bvec;
+
+		bio_init(&bio, &bvec, 1);
+
+		bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
+		bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
+		bio.bi_end_io = end_swap_bio_read;
+		bio_add_page(&bio, page, PAGE_SIZE, 0);
+		bio.bi_private = current;
+		BUG_ON(bio.bi_iter.bi_size != PAGE_SIZE);
+		bio_set_op_attrs(&bio, REQ_OP_READ, 0);
+		/* end_swap_bio_read calls bio_put unconditionally */
+		bio_get(&bio);
+		submit_bio(&bio);
 	}
-	__set_current_state(TASK_RUNNING);
-	bio_put(bio);
 
+	count_vm_event(PSWPIN);
 out:
 	return ret;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6ba4aab2db0b..855d50eeeaf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2931,6 +2931,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
 		p->flags |= SWP_STABLE_WRITES;
 
+	if (bdi_cap_sync_io_required(inode_to_bdi(inode)))
+		p->flags |= SWP_SYNC_IO;
+
 	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
 		int cpu;
 		unsigned long ci, nr_cluster;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-03  0:13       ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-03  0:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Andrew Morton, linux-kernel, karam . lee,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

Hi Ross,

On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> points for rw_pages() in a swap workload, and in all cases I do see an
> improvement over the code when rw_pages() is removed.  Here are the results
> from my random lab box:
> 
>   Average latency of swap_writepage()
> +------+------------+---------+-------------+
> |      | no rw_page | rw_page | Improvement |
> +-------------------------------------------+
> | PMEM |  5.0 us    |  4.7 us |     6%      |
> +-------------------------------------------+
> |  BTT |  6.8 us    |  6.1 us |    10%      |
> +------+------------+---------+-------------+
> 
>   Average latency of swap_readpage()
> +------+------------+---------+-------------+
> |      | no rw_page | rw_page | Improvement |
> +-------------------------------------------+
> | PMEM |  3.3 us    |  2.9 us |    12%      |
> +-------------------------------------------+
> |  BTT |  3.7 us    |  3.4 us |     8%      |
> +------+------------+---------+-------------+
> 
> The workload was pmbench, a memory benchmark, run on a system where I had
> severely restricted the amount of memory in the system with the 'mem' kernel
> command line parameter.  The benchmark was set up to test more memory than I
> allowed the OS to have so it spilled over into swap.
> 
> The PMEM or BTT device was set up as my swap device, and during the test I got
> a few hundred thousand samples of each of swap_writepage() and
> swap_writepage().  The PMEM/BTT device was just memory reserved with the
> memmap kernel command line parameter.
> 
> Thanks, Matthew, for asking for performance data.  It looks like removing this
> code would have been a mistake.

By suggestion of Christoph Hellwig, I made a quick patch which does IO without
dynamic bio allocation for swap IO. Actually, it's not formal patch to be
worth to send mainline yet but I believe it's enough to test the improvement.

Could you test patchset on pmem and btt without rw_page?

For working the patch, block drivers need to declare it's synchronous IO
device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
comes from (sis->flags & SWP_SYNC_IO) with removing condition check

if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.

Patchset is based on 4.13-rc3.


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 856d5dc02451..b1c5e9bf3ad5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -125,9 +125,9 @@ static inline bool is_partial_io(struct bio_vec *bvec)
 static void zram_revalidate_disk(struct zram *zram)
 {
 	revalidate_disk(zram->disk);
-	/* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
+	/* revalidate_disk reset the BDI capability so set again */
 	zram->disk->queue->backing_dev_info->capabilities |=
-		BDI_CAP_STABLE_WRITES;
+		(BDI_CAP_STABLE_WRITES|BDI_CAP_SYNC);
 }
 
 /*
@@ -1096,7 +1096,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
 static const struct block_device_operations zram_devops = {
 	.open = zram_open,
 	.swap_slot_free_notify = zram_slot_free_notify,
-	.rw_page = zram_rw_page,
+	// .rw_page = zram_rw_page,
 	.owner = THIS_MODULE
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..05eee145d964 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -130,6 +130,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_STABLE_WRITES	0x00000008
 #define BDI_CAP_STRICTLIMIT	0x00000010
 #define BDI_CAP_CGROUP_WRITEBACK 0x00000020
+#define BDI_CAP_SYNC		0x00000040
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
 	(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -177,6 +178,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
+static inline bool bdi_cap_sync_io_required(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_SYNC;
+}
+
 static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
 {
 	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d83d28e53e62..86457dbfd300 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -152,8 +152,9 @@ enum {
 	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
+	SWP_SYNC_IO	= (1 << 11),
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 11),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index b6c4ac388209..2c85e5182364 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -263,7 +263,6 @@ static sector_t swap_page_sector(struct page *page)
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
-	struct bio *bio;
 	int ret;
 	struct swap_info_struct *sis = page_swap_info(page);
 
@@ -316,25 +315,44 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_NOIO, page, end_write_func);
-	if (bio == NULL) {
-		set_page_dirty(page);
+	count_vm_event(PSWPOUT);
+
+	if (!(sis->flags & SWP_SYNC_IO)) {
+		struct bio *bio;
+
+		bio = get_swap_bio(GFP_NOIO, page, end_write_func);
+		if (bio == NULL) {
+			set_page_dirty(page);
+			unlock_page(page);
+			ret = -ENOMEM;
+			goto out;
+		}
+		bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+		set_page_writeback(page);
 		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
+		submit_bio(bio);
+	} else {
+		struct bio bio;
+		struct bio_vec bvec;
+
+		bio_init(&bio, &bvec, 1);
+
+		bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
+		bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
+		bio.bi_end_io = end_write_func;
+		bio_add_page(&bio, page, PAGE_SIZE, 0);
+		bio.bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+		bio_get(&bio);
+		set_page_writeback(page);
+		unlock_page(page);
+		submit_bio(&bio);
 	}
-	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
-	count_vm_event(PSWPOUT);
-	set_page_writeback(page);
-	unlock_page(page);
-	submit_bio(bio);
 out:
 	return ret;
 }
 
 int swap_readpage(struct page *page, bool do_poll)
 {
-	struct bio *bio;
 	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
 	blk_qc_t qc;
@@ -371,29 +389,49 @@ int swap_readpage(struct page *page, bool do_poll)
 	}
 
 	ret = 0;
-	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
-	if (bio == NULL) {
-		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
-	}
-	bdev = bio->bi_bdev;
-	bio->bi_private = current;
-	bio_set_op_attrs(bio, REQ_OP_READ, 0);
-	count_vm_event(PSWPIN);
-	bio_get(bio);
-	qc = submit_bio(bio);
-	while (do_poll) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio->bi_private))
-			break;
-
-		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
-			break;
+	if (!(sis->flags & SWP_SYNC_IO)) {
+		struct bio *bio;
+
+		bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
+		if (bio == NULL) {
+			unlock_page(page);
+			ret = -ENOMEM;
+			goto out;
+		}
+		bdev = bio->bi_bdev;
+		bio->bi_private = current;
+		bio_set_op_attrs(bio, REQ_OP_READ, 0);
+		bio_get(bio);
+		qc = submit_bio(bio);
+		while (do_poll) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(bio->bi_private))
+				break;
+
+			if (!blk_mq_poll(bdev_get_queue(bdev), qc))
+				break;
+		}
+		__set_current_state(TASK_RUNNING);
+		bio_put(bio);
+	} else {
+		struct bio bio;
+		struct bio_vec bvec;
+
+		bio_init(&bio, &bvec, 1);
+
+		bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
+		bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
+		bio.bi_end_io = end_swap_bio_read;
+		bio_add_page(&bio, page, PAGE_SIZE, 0);
+		bio.bi_private = current;
+		BUG_ON(bio.bi_iter.bi_size != PAGE_SIZE);
+		bio_set_op_attrs(&bio, REQ_OP_READ, 0);
+		/* end_swap_bio_read calls bio_put unconditionally */
+		bio_get(&bio);
+		submit_bio(&bio);
 	}
-	__set_current_state(TASK_RUNNING);
-	bio_put(bio);
 
+	count_vm_event(PSWPIN);
 out:
 	return ret;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6ba4aab2db0b..855d50eeeaf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2931,6 +2931,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
 		p->flags |= SWP_STABLE_WRITES;
 
+	if (bdi_cap_sync_io_required(inode_to_bdi(inode)))
+		p->flags |= SWP_SYNC_IO;
+
 	if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
 		int cpu;
 		unsigned long ci, nr_cluster;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-03  0:13       ` Minchan Kim
@ 2017-08-03  0:34         ` Dan Williams
  -1 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2017-08-03  0:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Chen, ,
	Ying

[ adding Tim and Ying who have also been looking at swap optimization
and rw_page interactions ]

On Wed, Aug 2, 2017 at 5:13 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Ross,
>
> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
>> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
>> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
>> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
>> > > whether the rw_page() interface made sense for synchronous memory drivers
>> > > [1][2].  It's unclear whether this interface has any performance benefit
>> > > for these drivers, but as we continue to fix bugs it is clear that it does
>> > > have a maintenance burden.  This series removes the rw_page()
>> > > implementations in brd, pmem and btt to relieve this burden.
>> >
>> > Why don't you measure whether it has performance benefits?  I don't
>> > understand why zram would see performance benefits and not other drivers.
>> > If it's going to be removed, then the whole interface should be removed,
>> > not just have the implementations removed from some drivers.
>>
>> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
>> points for rw_pages() in a swap workload, and in all cases I do see an
>> improvement over the code when rw_pages() is removed.  Here are the results
>> from my random lab box:
>>
>>   Average latency of swap_writepage()
>> +------+------------+---------+-------------+
>> |      | no rw_page | rw_page | Improvement |
>> +-------------------------------------------+
>> | PMEM |  5.0 us    |  4.7 us |     6%      |
>> +-------------------------------------------+
>> |  BTT |  6.8 us    |  6.1 us |    10%      |
>> +------+------------+---------+-------------+
>>
>>   Average latency of swap_readpage()
>> +------+------------+---------+-------------+
>> |      | no rw_page | rw_page | Improvement |
>> +-------------------------------------------+
>> | PMEM |  3.3 us    |  2.9 us |    12%      |
>> +-------------------------------------------+
>> |  BTT |  3.7 us    |  3.4 us |     8%      |
>> +------+------------+---------+-------------+
>>
>> The workload was pmbench, a memory benchmark, run on a system where I had
>> severely restricted the amount of memory in the system with the 'mem' kernel
>> command line parameter.  The benchmark was set up to test more memory than I
>> allowed the OS to have so it spilled over into swap.
>>
>> The PMEM or BTT device was set up as my swap device, and during the test I got
>> a few hundred thousand samples of each of swap_writepage() and
>> swap_writepage().  The PMEM/BTT device was just memory reserved with the
>> memmap kernel command line parameter.
>>
>> Thanks, Matthew, for asking for performance data.  It looks like removing this
>> code would have been a mistake.
>
> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> worth to send mainline yet but I believe it's enough to test the improvement.
>
> Could you test patchset on pmem and btt without rw_page?
>
> For working the patch, block drivers need to declare it's synchronous IO
> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
>
> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
>
> Patchset is based on 4.13-rc3.
>
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 856d5dc02451..b1c5e9bf3ad5 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -125,9 +125,9 @@ static inline bool is_partial_io(struct bio_vec *bvec)
>  static void zram_revalidate_disk(struct zram *zram)
>  {
>         revalidate_disk(zram->disk);
> -       /* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
> +       /* revalidate_disk reset the BDI capability so set again */
>         zram->disk->queue->backing_dev_info->capabilities |=
> -               BDI_CAP_STABLE_WRITES;
> +               (BDI_CAP_STABLE_WRITES|BDI_CAP_SYNC);
>  }
>
>  /*
> @@ -1096,7 +1096,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
>  static const struct block_device_operations zram_devops = {
>         .open = zram_open,
>         .swap_slot_free_notify = zram_slot_free_notify,
> -       .rw_page = zram_rw_page,
> +       // .rw_page = zram_rw_page,
>         .owner = THIS_MODULE
>  };
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 854e1bdd0b2a..05eee145d964 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -130,6 +130,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
>  #define BDI_CAP_STABLE_WRITES  0x00000008
>  #define BDI_CAP_STRICTLIMIT    0x00000010
>  #define BDI_CAP_CGROUP_WRITEBACK 0x00000020
> +#define BDI_CAP_SYNC           0x00000040
>
>  #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
>         (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
> @@ -177,6 +178,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> +static inline bool bdi_cap_sync_io_required(struct backing_dev_info *bdi)
> +{
> +       return bdi->capabilities & BDI_CAP_SYNC;
> +}
> +
>  static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
>  {
>         return bdi->capabilities & BDI_CAP_STABLE_WRITES;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d83d28e53e62..86457dbfd300 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -152,8 +152,9 @@ enum {
>         SWP_AREA_DISCARD = (1 << 8),    /* single-time swap area discards */
>         SWP_PAGE_DISCARD = (1 << 9),    /* freed swap page-cluster discards */
>         SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
> +       SWP_SYNC_IO     = (1 << 11),
>                                         /* add others here before... */
> -       SWP_SCANNING    = (1 << 11),    /* refcount in scan_swap_map */
> +       SWP_SCANNING    = (1 << 12),    /* refcount in scan_swap_map */
>  };
>
>  #define SWAP_CLUSTER_MAX 32UL
> diff --git a/mm/page_io.c b/mm/page_io.c
> index b6c4ac388209..2c85e5182364 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -263,7 +263,6 @@ static sector_t swap_page_sector(struct page *page)
>  int __swap_writepage(struct page *page, struct writeback_control *wbc,
>                 bio_end_io_t end_write_func)
>  {
> -       struct bio *bio;
>         int ret;
>         struct swap_info_struct *sis = page_swap_info(page);
>
> @@ -316,25 +315,44 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>         }
>
>         ret = 0;
> -       bio = get_swap_bio(GFP_NOIO, page, end_write_func);
> -       if (bio == NULL) {
> -               set_page_dirty(page);
> +       count_vm_event(PSWPOUT);
> +
> +       if (!(sis->flags & SWP_SYNC_IO)) {
> +               struct bio *bio;
> +
> +               bio = get_swap_bio(GFP_NOIO, page, end_write_func);
> +               if (bio == NULL) {
> +                       set_page_dirty(page);
> +                       unlock_page(page);
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +               bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> +               set_page_writeback(page);
>                 unlock_page(page);
> -               ret = -ENOMEM;
> -               goto out;
> +               submit_bio(bio);
> +       } else {
> +               struct bio bio;
> +               struct bio_vec bvec;
> +
> +               bio_init(&bio, &bvec, 1);
> +
> +               bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
> +               bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
> +               bio.bi_end_io = end_write_func;
> +               bio_add_page(&bio, page, PAGE_SIZE, 0);
> +               bio.bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> +               bio_get(&bio);
> +               set_page_writeback(page);
> +               unlock_page(page);
> +               submit_bio(&bio);
>         }
> -       bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> -       count_vm_event(PSWPOUT);
> -       set_page_writeback(page);
> -       unlock_page(page);
> -       submit_bio(bio);
>  out:
>         return ret;
>  }
>
>  int swap_readpage(struct page *page, bool do_poll)
>  {
> -       struct bio *bio;
>         int ret = 0;
>         struct swap_info_struct *sis = page_swap_info(page);
>         blk_qc_t qc;
> @@ -371,29 +389,49 @@ int swap_readpage(struct page *page, bool do_poll)
>         }
>
>         ret = 0;
> -       bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> -       if (bio == NULL) {
> -               unlock_page(page);
> -               ret = -ENOMEM;
> -               goto out;
> -       }
> -       bdev = bio->bi_bdev;
> -       bio->bi_private = current;
> -       bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -       count_vm_event(PSWPIN);
> -       bio_get(bio);
> -       qc = submit_bio(bio);
> -       while (do_poll) {
> -               set_current_state(TASK_UNINTERRUPTIBLE);
> -               if (!READ_ONCE(bio->bi_private))
> -                       break;
> -
> -               if (!blk_mq_poll(bdev_get_queue(bdev), qc))
> -                       break;
> +       if (!(sis->flags & SWP_SYNC_IO)) {
> +               struct bio *bio;
> +
> +               bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> +               if (bio == NULL) {
> +                       unlock_page(page);
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +               bdev = bio->bi_bdev;
> +               bio->bi_private = current;
> +               bio_set_op_attrs(bio, REQ_OP_READ, 0);
> +               bio_get(bio);
> +               qc = submit_bio(bio);
> +               while (do_poll) {
> +                       set_current_state(TASK_UNINTERRUPTIBLE);
> +                       if (!READ_ONCE(bio->bi_private))
> +                               break;
> +
> +                       if (!blk_mq_poll(bdev_get_queue(bdev), qc))
> +                               break;
> +               }
> +               __set_current_state(TASK_RUNNING);
> +               bio_put(bio);
> +       } else {
> +               struct bio bio;
> +               struct bio_vec bvec;
> +
> +               bio_init(&bio, &bvec, 1);
> +
> +               bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
> +               bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
> +               bio.bi_end_io = end_swap_bio_read;
> +               bio_add_page(&bio, page, PAGE_SIZE, 0);
> +               bio.bi_private = current;
> +               BUG_ON(bio.bi_iter.bi_size != PAGE_SIZE);
> +               bio_set_op_attrs(&bio, REQ_OP_READ, 0);
> +               /* end_swap_bio_read calls bio_put unconditionally */
> +               bio_get(&bio);
> +               submit_bio(&bio);
>         }
> -       __set_current_state(TASK_RUNNING);
> -       bio_put(bio);
>
> +       count_vm_event(PSWPIN);
>  out:
>         return ret;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6ba4aab2db0b..855d50eeeaf9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2931,6 +2931,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
>                 p->flags |= SWP_STABLE_WRITES;
>
> +       if (bdi_cap_sync_io_required(inode_to_bdi(inode)))
> +               p->flags |= SWP_SYNC_IO;
> +
>         if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
>                 int cpu;
>                 unsigned long ci, nr_cluster;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-03  0:34         ` Dan Williams
  0 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2017-08-03  0:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dave Chinner, Jan Kara, Jens Axboe,
	Vishal Verma, linux-nvdimm, Chen, Tim C, Huang, Ying

[ adding Tim and Ying who have also been looking at swap optimization
and rw_page interactions ]

On Wed, Aug 2, 2017 at 5:13 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Ross,
>
> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
>> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
>> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
>> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
>> > > whether the rw_page() interface made sense for synchronous memory drivers
>> > > [1][2].  It's unclear whether this interface has any performance benefit
>> > > for these drivers, but as we continue to fix bugs it is clear that it does
>> > > have a maintenance burden.  This series removes the rw_page()
>> > > implementations in brd, pmem and btt to relieve this burden.
>> >
>> > Why don't you measure whether it has performance benefits?  I don't
>> > understand why zram would see performance benefits and not other drivers.
>> > If it's going to be removed, then the whole interface should be removed,
>> > not just have the implementations removed from some drivers.
>>
>> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
>> points for rw_pages() in a swap workload, and in all cases I do see an
>> improvement over the code when rw_pages() is removed.  Here are the results
>> from my random lab box:
>>
>>   Average latency of swap_writepage()
>> +------+------------+---------+-------------+
>> |      | no rw_page | rw_page | Improvement |
>> +-------------------------------------------+
>> | PMEM |  5.0 us    |  4.7 us |     6%      |
>> +-------------------------------------------+
>> |  BTT |  6.8 us    |  6.1 us |    10%      |
>> +------+------------+---------+-------------+
>>
>>   Average latency of swap_readpage()
>> +------+------------+---------+-------------+
>> |      | no rw_page | rw_page | Improvement |
>> +-------------------------------------------+
>> | PMEM |  3.3 us    |  2.9 us |    12%      |
>> +-------------------------------------------+
>> |  BTT |  3.7 us    |  3.4 us |     8%      |
>> +------+------------+---------+-------------+
>>
>> The workload was pmbench, a memory benchmark, run on a system where I had
>> severely restricted the amount of memory in the system with the 'mem' kernel
>> command line parameter.  The benchmark was set up to test more memory than I
>> allowed the OS to have so it spilled over into swap.
>>
>> The PMEM or BTT device was set up as my swap device, and during the test I got
>> a few hundred thousand samples of each of swap_writepage() and
>> swap_writepage().  The PMEM/BTT device was just memory reserved with the
>> memmap kernel command line parameter.
>>
>> Thanks, Matthew, for asking for performance data.  It looks like removing this
>> code would have been a mistake.
>
> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> worth to send mainline yet but I believe it's enough to test the improvement.
>
> Could you test patchset on pmem and btt without rw_page?
>
> For working the patch, block drivers need to declare it's synchronous IO
> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
>
> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
>
> Patchset is based on 4.13-rc3.
>
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 856d5dc02451..b1c5e9bf3ad5 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -125,9 +125,9 @@ static inline bool is_partial_io(struct bio_vec *bvec)
>  static void zram_revalidate_disk(struct zram *zram)
>  {
>         revalidate_disk(zram->disk);
> -       /* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
> +       /* revalidate_disk reset the BDI capability so set again */
>         zram->disk->queue->backing_dev_info->capabilities |=
> -               BDI_CAP_STABLE_WRITES;
> +               (BDI_CAP_STABLE_WRITES|BDI_CAP_SYNC);
>  }
>
>  /*
> @@ -1096,7 +1096,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
>  static const struct block_device_operations zram_devops = {
>         .open = zram_open,
>         .swap_slot_free_notify = zram_slot_free_notify,
> -       .rw_page = zram_rw_page,
> +       // .rw_page = zram_rw_page,
>         .owner = THIS_MODULE
>  };
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 854e1bdd0b2a..05eee145d964 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -130,6 +130,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
>  #define BDI_CAP_STABLE_WRITES  0x00000008
>  #define BDI_CAP_STRICTLIMIT    0x00000010
>  #define BDI_CAP_CGROUP_WRITEBACK 0x00000020
> +#define BDI_CAP_SYNC           0x00000040
>
>  #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
>         (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
> @@ -177,6 +178,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> +static inline bool bdi_cap_sync_io_required(struct backing_dev_info *bdi)
> +{
> +       return bdi->capabilities & BDI_CAP_SYNC;
> +}
> +
>  static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
>  {
>         return bdi->capabilities & BDI_CAP_STABLE_WRITES;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d83d28e53e62..86457dbfd300 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -152,8 +152,9 @@ enum {
>         SWP_AREA_DISCARD = (1 << 8),    /* single-time swap area discards */
>         SWP_PAGE_DISCARD = (1 << 9),    /* freed swap page-cluster discards */
>         SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
> +       SWP_SYNC_IO     = (1 << 11),
>                                         /* add others here before... */
> -       SWP_SCANNING    = (1 << 11),    /* refcount in scan_swap_map */
> +       SWP_SCANNING    = (1 << 12),    /* refcount in scan_swap_map */
>  };
>
>  #define SWAP_CLUSTER_MAX 32UL
> diff --git a/mm/page_io.c b/mm/page_io.c
> index b6c4ac388209..2c85e5182364 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -263,7 +263,6 @@ static sector_t swap_page_sector(struct page *page)
>  int __swap_writepage(struct page *page, struct writeback_control *wbc,
>                 bio_end_io_t end_write_func)
>  {
> -       struct bio *bio;
>         int ret;
>         struct swap_info_struct *sis = page_swap_info(page);
>
> @@ -316,25 +315,44 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>         }
>
>         ret = 0;
> -       bio = get_swap_bio(GFP_NOIO, page, end_write_func);
> -       if (bio == NULL) {
> -               set_page_dirty(page);
> +       count_vm_event(PSWPOUT);
> +
> +       if (!(sis->flags & SWP_SYNC_IO)) {
> +               struct bio *bio;
> +
> +               bio = get_swap_bio(GFP_NOIO, page, end_write_func);
> +               if (bio == NULL) {
> +                       set_page_dirty(page);
> +                       unlock_page(page);
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +               bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> +               set_page_writeback(page);
>                 unlock_page(page);
> -               ret = -ENOMEM;
> -               goto out;
> +               submit_bio(bio);
> +       } else {
> +               struct bio bio;
> +               struct bio_vec bvec;
> +
> +               bio_init(&bio, &bvec, 1);
> +
> +               bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
> +               bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
> +               bio.bi_end_io = end_write_func;
> +               bio_add_page(&bio, page, PAGE_SIZE, 0);
> +               bio.bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> +               bio_get(&bio);
> +               set_page_writeback(page);
> +               unlock_page(page);
> +               submit_bio(&bio);
>         }
> -       bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> -       count_vm_event(PSWPOUT);
> -       set_page_writeback(page);
> -       unlock_page(page);
> -       submit_bio(bio);
>  out:
>         return ret;
>  }
>
>  int swap_readpage(struct page *page, bool do_poll)
>  {
> -       struct bio *bio;
>         int ret = 0;
>         struct swap_info_struct *sis = page_swap_info(page);
>         blk_qc_t qc;
> @@ -371,29 +389,49 @@ int swap_readpage(struct page *page, bool do_poll)
>         }
>
>         ret = 0;
> -       bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> -       if (bio == NULL) {
> -               unlock_page(page);
> -               ret = -ENOMEM;
> -               goto out;
> -       }
> -       bdev = bio->bi_bdev;
> -       bio->bi_private = current;
> -       bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -       count_vm_event(PSWPIN);
> -       bio_get(bio);
> -       qc = submit_bio(bio);
> -       while (do_poll) {
> -               set_current_state(TASK_UNINTERRUPTIBLE);
> -               if (!READ_ONCE(bio->bi_private))
> -                       break;
> -
> -               if (!blk_mq_poll(bdev_get_queue(bdev), qc))
> -                       break;
> +       if (!(sis->flags & SWP_SYNC_IO)) {
> +               struct bio *bio;
> +
> +               bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> +               if (bio == NULL) {
> +                       unlock_page(page);
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +               bdev = bio->bi_bdev;
> +               bio->bi_private = current;
> +               bio_set_op_attrs(bio, REQ_OP_READ, 0);
> +               bio_get(bio);
> +               qc = submit_bio(bio);
> +               while (do_poll) {
> +                       set_current_state(TASK_UNINTERRUPTIBLE);
> +                       if (!READ_ONCE(bio->bi_private))
> +                               break;
> +
> +                       if (!blk_mq_poll(bdev_get_queue(bdev), qc))
> +                               break;
> +               }
> +               __set_current_state(TASK_RUNNING);
> +               bio_put(bio);
> +       } else {
> +               struct bio bio;
> +               struct bio_vec bvec;
> +
> +               bio_init(&bio, &bvec, 1);
> +
> +               bio.bi_iter.bi_sector = map_swap_page(page, &bio.bi_bdev);
> +               bio.bi_iter.bi_sector <<= PAGE_SHIFT - 9;
> +               bio.bi_end_io = end_swap_bio_read;
> +               bio_add_page(&bio, page, PAGE_SIZE, 0);
> +               bio.bi_private = current;
> +               BUG_ON(bio.bi_iter.bi_size != PAGE_SIZE);
> +               bio_set_op_attrs(&bio, REQ_OP_READ, 0);
> +               /* end_swap_bio_read calls bio_put unconditionally */
> +               bio_get(&bio);
> +               submit_bio(&bio);
>         }
> -       __set_current_state(TASK_RUNNING);
> -       bio_put(bio);
>
> +       count_vm_event(PSWPIN);
>  out:
>         return ret;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6ba4aab2db0b..855d50eeeaf9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2931,6 +2931,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
>                 p->flags |= SWP_STABLE_WRITES;
>
> +       if (bdi_cap_sync_io_required(inode_to_bdi(inode)))
> +               p->flags |= SWP_SYNC_IO;
> +
>         if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
>                 int cpu;
>                 unsigned long ci, nr_cluster;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-03  0:13       ` Minchan Kim
@ 2017-08-03  8:05         ` Christoph Hellwig
  -1 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-08-03  8:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, seungho1.park,
	Jan Kara, karam . lee, Andrew Morton, Nitin Gupta

FYI, for the read side we should use the on-stack bio unconditionally,
as it will always be a win (or not show up at all).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-03  8:05         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2017-08-03  8:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dan Williams, Dave Chinner, Jan Kara,
	Jens Axboe, Vishal Verma, linux-nvdimm

FYI, for the read side we should use the on-stack bio unconditionally,
as it will always be a win (or not show up at all).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] btt: remove btt_rw_page()
  2017-07-28 16:56   ` Ross Zwisler
@ 2017-08-03 16:15     ` kbuild test robot
  -1 siblings, 0 replies; 54+ messages in thread
From: kbuild test robot @ 2017-08-03 16:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Minchan Kim,
	kbuild-all, seungho1.park, Jan Kara, Andrew Morton, karam . lee,
	Nitin Gupta

Hi Ross,

[auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
[also build test WARNING on v4.13-rc3]
[cannot apply to next-20170803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ross-Zwisler/btt-remove-btt_rw_page/20170729-165642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: i386-randconfig-h0-08032208 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from drivers/nvdimm/btt.c:27:0:
   drivers/nvdimm/btt.c: In function 'btt_make_request':
>> drivers/nvdimm/nd.h:407:2: warning: 'start' may be used uninitialized in this function [-Wmaybe-uninitialized]
     generic_end_io_acct(bio_data_dir(bio), &disk->part0, start);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/nvdimm/btt.c:1202:16: note: 'start' was declared here
     unsigned long start;
                   ^~~~~

vim +/start +407 drivers/nvdimm/nd.h

cd03412a5 Dan Williams 2016-03-11  340  
3d88002e4 Dan Williams 2015-05-31  341  struct nd_region *to_nd_region(struct device *dev);
3d88002e4 Dan Williams 2015-05-31  342  int nd_region_to_nstype(struct nd_region *nd_region);
3d88002e4 Dan Williams 2015-05-31  343  int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
c12c48ce8 Dan Williams 2017-06-04  344  u64 nd_region_interleave_set_cookie(struct nd_region *nd_region,
c12c48ce8 Dan Williams 2017-06-04  345  		struct nd_namespace_index *nsindex);
86ef58a4e Dan Williams 2017-02-28  346  u64 nd_region_interleave_set_altcookie(struct nd_region *nd_region);
3d88002e4 Dan Williams 2015-05-31  347  void nvdimm_bus_lock(struct device *dev);
3d88002e4 Dan Williams 2015-05-31  348  void nvdimm_bus_unlock(struct device *dev);
3d88002e4 Dan Williams 2015-05-31  349  bool is_nvdimm_bus_locked(struct device *dev);
581388209 Dan Williams 2015-06-23  350  int nvdimm_revalidate_disk(struct gendisk *disk);
bf9bccc14 Dan Williams 2015-06-17  351  void nvdimm_drvdata_release(struct kref *kref);
bf9bccc14 Dan Williams 2015-06-17  352  void put_ndd(struct nvdimm_drvdata *ndd);
4a826c83d Dan Williams 2015-06-09  353  int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
4a826c83d Dan Williams 2015-06-09  354  void nvdimm_free_dpa(struct nvdimm_drvdata *ndd, struct resource *res);
4a826c83d Dan Williams 2015-06-09  355  struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
4a826c83d Dan Williams 2015-06-09  356  		struct nd_label_id *label_id, resource_size_t start,
4a826c83d Dan Williams 2015-06-09  357  		resource_size_t n);
8c2f7e865 Dan Williams 2015-06-25  358  resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
8c2f7e865 Dan Williams 2015-06-25  359  struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev);
5212e11fd Vishal Verma 2015-06-25  360  int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns);
298f2bc5d Dan Williams 2016-03-15  361  int nvdimm_namespace_detach_btt(struct nd_btt *nd_btt);
5212e11fd Vishal Verma 2015-06-25  362  const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
5212e11fd Vishal Verma 2015-06-25  363  		char *name);
f979b13c3 Dan Williams 2017-06-04  364  unsigned int pmem_sector_size(struct nd_namespace_common *ndns);
a39018029 Dan Williams 2016-04-07  365  void nvdimm_badblocks_populate(struct nd_region *nd_region,
a39018029 Dan Williams 2016-04-07  366  		struct badblocks *bb, const struct resource *res);
200c79da8 Dan Williams 2016-03-22  367  #if IS_ENABLED(CONFIG_ND_CLAIM)
ac515c084 Dan Williams 2016-03-22  368  struct vmem_altmap *nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
ac515c084 Dan Williams 2016-03-22  369  		struct resource *res, struct vmem_altmap *altmap);
200c79da8 Dan Williams 2016-03-22  370  int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio);
200c79da8 Dan Williams 2016-03-22  371  void devm_nsio_disable(struct device *dev, struct nd_namespace_io *nsio);
200c79da8 Dan Williams 2016-03-22  372  #else
ac515c084 Dan Williams 2016-03-22  373  static inline struct vmem_altmap *nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
ac515c084 Dan Williams 2016-03-22  374  		struct resource *res, struct vmem_altmap *altmap)
ac515c084 Dan Williams 2016-03-22  375  {
ac515c084 Dan Williams 2016-03-22  376  	return ERR_PTR(-ENXIO);
ac515c084 Dan Williams 2016-03-22  377  }
200c79da8 Dan Williams 2016-03-22  378  static inline int devm_nsio_enable(struct device *dev,
200c79da8 Dan Williams 2016-03-22  379  		struct nd_namespace_io *nsio)
200c79da8 Dan Williams 2016-03-22  380  {
200c79da8 Dan Williams 2016-03-22  381  	return -ENXIO;
200c79da8 Dan Williams 2016-03-22  382  }
200c79da8 Dan Williams 2016-03-22  383  static inline void devm_nsio_disable(struct device *dev,
200c79da8 Dan Williams 2016-03-22  384  		struct nd_namespace_io *nsio)
200c79da8 Dan Williams 2016-03-22  385  {
200c79da8 Dan Williams 2016-03-22  386  }
200c79da8 Dan Williams 2016-03-22  387  #endif
047fc8a1f Ross Zwisler 2015-06-25  388  int nd_blk_region_init(struct nd_region *nd_region);
e5ae3b252 Dan Williams 2016-06-07  389  int nd_region_activate(struct nd_region *nd_region);
f0dc089ce Dan Williams 2015-05-16  390  void __nd_iostat_start(struct bio *bio, unsigned long *start);
f0dc089ce Dan Williams 2015-05-16  391  static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
f0dc089ce Dan Williams 2015-05-16  392  {
f0dc089ce Dan Williams 2015-05-16  393  	struct gendisk *disk = bio->bi_bdev->bd_disk;
f0dc089ce Dan Williams 2015-05-16  394  
f0dc089ce Dan Williams 2015-05-16  395  	if (!blk_queue_io_stat(disk->queue))
f0dc089ce Dan Williams 2015-05-16  396  		return false;
f0dc089ce Dan Williams 2015-05-16  397  
8d7c22ac0 Toshi Kani   2016-10-19  398  	*start = jiffies;
8d7c22ac0 Toshi Kani   2016-10-19  399  	generic_start_io_acct(bio_data_dir(bio),
8d7c22ac0 Toshi Kani   2016-10-19  400  			      bio_sectors(bio), &disk->part0);
f0dc089ce Dan Williams 2015-05-16  401  	return true;
f0dc089ce Dan Williams 2015-05-16  402  }
8d7c22ac0 Toshi Kani   2016-10-19  403  static inline void nd_iostat_end(struct bio *bio, unsigned long start)
8d7c22ac0 Toshi Kani   2016-10-19  404  {
8d7c22ac0 Toshi Kani   2016-10-19  405  	struct gendisk *disk = bio->bi_bdev->bd_disk;
8d7c22ac0 Toshi Kani   2016-10-19  406  
8d7c22ac0 Toshi Kani   2016-10-19 @407  	generic_end_io_acct(bio_data_dir(bio), &disk->part0, start);

:::::: The code at line 407 was first introduced by commit
:::::: 8d7c22ac0c036978a072b7e13c607b5402c474e0 libnvdimm: use generic iostat interfaces

:::::: TO: Toshi Kani <toshi.kani@hpe.com>
:::::: CC: Dan Williams <dan.j.williams@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] btt: remove btt_rw_page()
@ 2017-08-03 16:15     ` kbuild test robot
  0 siblings, 0 replies; 54+ messages in thread
From: kbuild test robot @ 2017-08-03 16:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: kbuild-all, Andrew Morton, linux-kernel, Ross Zwisler,
	karam . lee, Minchan Kim, Jerome Marchand, Nitin Gupta,
	seungho1.park, Matthew Wilcox, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma, linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 7190 bytes --]

Hi Ross,

[auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
[also build test WARNING on v4.13-rc3]
[cannot apply to next-20170803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ross-Zwisler/btt-remove-btt_rw_page/20170729-165642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: i386-randconfig-h0-08032208 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from drivers/nvdimm/btt.c:27:0:
   drivers/nvdimm/btt.c: In function 'btt_make_request':
>> drivers/nvdimm/nd.h:407:2: warning: 'start' may be used uninitialized in this function [-Wmaybe-uninitialized]
     generic_end_io_acct(bio_data_dir(bio), &disk->part0, start);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/nvdimm/btt.c:1202:16: note: 'start' was declared here
     unsigned long start;
                   ^~~~~

vim +/start +407 drivers/nvdimm/nd.h

cd03412a5 Dan Williams 2016-03-11  340  
3d88002e4 Dan Williams 2015-05-31  341  struct nd_region *to_nd_region(struct device *dev);
3d88002e4 Dan Williams 2015-05-31  342  int nd_region_to_nstype(struct nd_region *nd_region);
3d88002e4 Dan Williams 2015-05-31  343  int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
c12c48ce8 Dan Williams 2017-06-04  344  u64 nd_region_interleave_set_cookie(struct nd_region *nd_region,
c12c48ce8 Dan Williams 2017-06-04  345  		struct nd_namespace_index *nsindex);
86ef58a4e Dan Williams 2017-02-28  346  u64 nd_region_interleave_set_altcookie(struct nd_region *nd_region);
3d88002e4 Dan Williams 2015-05-31  347  void nvdimm_bus_lock(struct device *dev);
3d88002e4 Dan Williams 2015-05-31  348  void nvdimm_bus_unlock(struct device *dev);
3d88002e4 Dan Williams 2015-05-31  349  bool is_nvdimm_bus_locked(struct device *dev);
581388209 Dan Williams 2015-06-23  350  int nvdimm_revalidate_disk(struct gendisk *disk);
bf9bccc14 Dan Williams 2015-06-17  351  void nvdimm_drvdata_release(struct kref *kref);
bf9bccc14 Dan Williams 2015-06-17  352  void put_ndd(struct nvdimm_drvdata *ndd);
4a826c83d Dan Williams 2015-06-09  353  int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
4a826c83d Dan Williams 2015-06-09  354  void nvdimm_free_dpa(struct nvdimm_drvdata *ndd, struct resource *res);
4a826c83d Dan Williams 2015-06-09  355  struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
4a826c83d Dan Williams 2015-06-09  356  		struct nd_label_id *label_id, resource_size_t start,
4a826c83d Dan Williams 2015-06-09  357  		resource_size_t n);
8c2f7e865 Dan Williams 2015-06-25  358  resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
8c2f7e865 Dan Williams 2015-06-25  359  struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev);
5212e11fd Vishal Verma 2015-06-25  360  int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns);
298f2bc5d Dan Williams 2016-03-15  361  int nvdimm_namespace_detach_btt(struct nd_btt *nd_btt);
5212e11fd Vishal Verma 2015-06-25  362  const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
5212e11fd Vishal Verma 2015-06-25  363  		char *name);
f979b13c3 Dan Williams 2017-06-04  364  unsigned int pmem_sector_size(struct nd_namespace_common *ndns);
a39018029 Dan Williams 2016-04-07  365  void nvdimm_badblocks_populate(struct nd_region *nd_region,
a39018029 Dan Williams 2016-04-07  366  		struct badblocks *bb, const struct resource *res);
200c79da8 Dan Williams 2016-03-22  367  #if IS_ENABLED(CONFIG_ND_CLAIM)
ac515c084 Dan Williams 2016-03-22  368  struct vmem_altmap *nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
ac515c084 Dan Williams 2016-03-22  369  		struct resource *res, struct vmem_altmap *altmap);
200c79da8 Dan Williams 2016-03-22  370  int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio);
200c79da8 Dan Williams 2016-03-22  371  void devm_nsio_disable(struct device *dev, struct nd_namespace_io *nsio);
200c79da8 Dan Williams 2016-03-22  372  #else
ac515c084 Dan Williams 2016-03-22  373  static inline struct vmem_altmap *nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
ac515c084 Dan Williams 2016-03-22  374  		struct resource *res, struct vmem_altmap *altmap)
ac515c084 Dan Williams 2016-03-22  375  {
ac515c084 Dan Williams 2016-03-22  376  	return ERR_PTR(-ENXIO);
ac515c084 Dan Williams 2016-03-22  377  }
200c79da8 Dan Williams 2016-03-22  378  static inline int devm_nsio_enable(struct device *dev,
200c79da8 Dan Williams 2016-03-22  379  		struct nd_namespace_io *nsio)
200c79da8 Dan Williams 2016-03-22  380  {
200c79da8 Dan Williams 2016-03-22  381  	return -ENXIO;
200c79da8 Dan Williams 2016-03-22  382  }
200c79da8 Dan Williams 2016-03-22  383  static inline void devm_nsio_disable(struct device *dev,
200c79da8 Dan Williams 2016-03-22  384  		struct nd_namespace_io *nsio)
200c79da8 Dan Williams 2016-03-22  385  {
200c79da8 Dan Williams 2016-03-22  386  }
200c79da8 Dan Williams 2016-03-22  387  #endif
047fc8a1f Ross Zwisler 2015-06-25  388  int nd_blk_region_init(struct nd_region *nd_region);
e5ae3b252 Dan Williams 2016-06-07  389  int nd_region_activate(struct nd_region *nd_region);
f0dc089ce Dan Williams 2015-05-16  390  void __nd_iostat_start(struct bio *bio, unsigned long *start);
f0dc089ce Dan Williams 2015-05-16  391  static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
f0dc089ce Dan Williams 2015-05-16  392  {
f0dc089ce Dan Williams 2015-05-16  393  	struct gendisk *disk = bio->bi_bdev->bd_disk;
f0dc089ce Dan Williams 2015-05-16  394  
f0dc089ce Dan Williams 2015-05-16  395  	if (!blk_queue_io_stat(disk->queue))
f0dc089ce Dan Williams 2015-05-16  396  		return false;
f0dc089ce Dan Williams 2015-05-16  397  
8d7c22ac0 Toshi Kani   2016-10-19  398  	*start = jiffies;
8d7c22ac0 Toshi Kani   2016-10-19  399  	generic_start_io_acct(bio_data_dir(bio),
8d7c22ac0 Toshi Kani   2016-10-19  400  			      bio_sectors(bio), &disk->part0);
f0dc089ce Dan Williams 2015-05-16  401  	return true;
f0dc089ce Dan Williams 2015-05-16  402  }
8d7c22ac0 Toshi Kani   2016-10-19  403  static inline void nd_iostat_end(struct bio *bio, unsigned long start)
8d7c22ac0 Toshi Kani   2016-10-19  404  {
8d7c22ac0 Toshi Kani   2016-10-19  405  	struct gendisk *disk = bio->bi_bdev->bd_disk;
8d7c22ac0 Toshi Kani   2016-10-19  406  
8d7c22ac0 Toshi Kani   2016-10-19 @407  	generic_end_io_acct(bio_data_dir(bio), &disk->part0, start);

:::::: The code at line 407 was first introduced by commit
:::::: 8d7c22ac0c036978a072b7e13c607b5402c474e0 libnvdimm: use generic iostat interfaces

:::::: TO: Toshi Kani <toshi.kani@hpe.com>
:::::: CC: Dan Williams <dan.j.williams@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25527 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-03  0:13       ` Minchan Kim
@ 2017-08-03 21:13         ` Ross Zwisler
  -1 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-08-03 21:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, seungho1.park,
	Jan Kara, karam . lee, Andrew Morton, Nitin Gupta

On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> Hi Ross,
> 
> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > points for rw_pages() in a swap workload, and in all cases I do see an
> > improvement over the code when rw_pages() is removed.  Here are the results
> > from my random lab box:
> > 
> >   Average latency of swap_writepage()
> > +------+------------+---------+-------------+
> > |      | no rw_page | rw_page | Improvement |
> > +-------------------------------------------+
> > | PMEM |  5.0 us    |  4.7 us |     6%      |
> > +-------------------------------------------+
> > |  BTT |  6.8 us    |  6.1 us |    10%      |
> > +------+------------+---------+-------------+
> > 
> >   Average latency of swap_readpage()
> > +------+------------+---------+-------------+
> > |      | no rw_page | rw_page | Improvement |
> > +-------------------------------------------+
> > | PMEM |  3.3 us    |  2.9 us |    12%      |
> > +-------------------------------------------+
> > |  BTT |  3.7 us    |  3.4 us |     8%      |
> > +------+------------+---------+-------------+
> > 
> > The workload was pmbench, a memory benchmark, run on a system where I had
> > severely restricted the amount of memory in the system with the 'mem' kernel
> > command line parameter.  The benchmark was set up to test more memory than I
> > allowed the OS to have so it spilled over into swap.
> > 
> > The PMEM or BTT device was set up as my swap device, and during the test I got
> > a few hundred thousand samples of each of swap_writepage() and
> > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > memmap kernel command line parameter.
> > 
> > Thanks, Matthew, for asking for performance data.  It looks like removing this
> > code would have been a mistake.
> 
> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> worth to send mainline yet but I believe it's enough to test the improvement.
> 
> Could you test patchset on pmem and btt without rw_page?
> 
> For working the patch, block drivers need to declare it's synchronous IO
> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> 
> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> 
> Patchset is based on 4.13-rc3.

Thanks for the patch, here are the updated results from my test box:

 Average latency of swap_writepage()
+------+------------+---------+---------+
|      | no rw_page | minchan | rw_page |
+----------------------------------------
| PMEM |  5.0 us    | 4.98 us |  4.7 us |
+----------------------------------------
|  BTT |  6.8 us    | 6.3 us  |  6.1 us |
+------+------------+---------+---------+
  				   
 Average latency of swap_readpage()
+------+------------+---------+---------+
|      | no rw_page | minchan | rw_page |
+----------------------------------------
| PMEM |  3.3 us    | 3.27 us |  2.9 us |
+----------------------------------------
|  BTT |  3.7 us    | 3.44 us |  3.4 us |
+------+------------+---------+---------+

I've added another digit in precision in some cases to help differentiate the
various results.

In all cases your patches did perform better than with the regularly allocated
BIO, but again for all cases the rw_page() path was the fastest, even if only
marginally.

- Ross
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-03 21:13         ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-08-03 21:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dan Williams, Dave Chinner, Jan Kara,
	Jens Axboe, Vishal Verma, linux-nvdimm

On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> Hi Ross,
> 
> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > points for rw_pages() in a swap workload, and in all cases I do see an
> > improvement over the code when rw_pages() is removed.  Here are the results
> > from my random lab box:
> > 
> >   Average latency of swap_writepage()
> > +------+------------+---------+-------------+
> > |      | no rw_page | rw_page | Improvement |
> > +-------------------------------------------+
> > | PMEM |  5.0 us    |  4.7 us |     6%      |
> > +-------------------------------------------+
> > |  BTT |  6.8 us    |  6.1 us |    10%      |
> > +------+------------+---------+-------------+
> > 
> >   Average latency of swap_readpage()
> > +------+------------+---------+-------------+
> > |      | no rw_page | rw_page | Improvement |
> > +-------------------------------------------+
> > | PMEM |  3.3 us    |  2.9 us |    12%      |
> > +-------------------------------------------+
> > |  BTT |  3.7 us    |  3.4 us |     8%      |
> > +------+------------+---------+-------------+
> > 
> > The workload was pmbench, a memory benchmark, run on a system where I had
> > severely restricted the amount of memory in the system with the 'mem' kernel
> > command line parameter.  The benchmark was set up to test more memory than I
> > allowed the OS to have so it spilled over into swap.
> > 
> > The PMEM or BTT device was set up as my swap device, and during the test I got
> > a few hundred thousand samples of each of swap_writepage() and
> > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > memmap kernel command line parameter.
> > 
> > Thanks, Matthew, for asking for performance data.  It looks like removing this
> > code would have been a mistake.
> 
> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> worth to send mainline yet but I believe it's enough to test the improvement.
> 
> Could you test patchset on pmem and btt without rw_page?
> 
> For working the patch, block drivers need to declare it's synchronous IO
> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> 
> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> 
> Patchset is based on 4.13-rc3.

Thanks for the patch, here are the updated results from my test box:

 Average latency of swap_writepage()
+------+------------+---------+---------+
|      | no rw_page | minchan | rw_page |
+----------------------------------------
| PMEM |  5.0 us    | 4.98 us |  4.7 us |
+----------------------------------------
|  BTT |  6.8 us    | 6.3 us  |  6.1 us |
+------+------------+---------+---------+
  				   
 Average latency of swap_readpage()
+------+------------+---------+---------+
|      | no rw_page | minchan | rw_page |
+----------------------------------------
| PMEM |  3.3 us    | 3.27 us |  2.9 us |
+----------------------------------------
|  BTT |  3.7 us    | 3.44 us |  3.4 us |
+------+------------+---------+---------+

I've added another digit in precision in some cases to help differentiate the
various results.

In all cases your patches did perform better than with the regularly allocated
BIO, but again for all cases the rw_page() path was the fastest, even if only
marginally.

- Ross

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-03 21:13         ` Ross Zwisler
@ 2017-08-03 21:17           ` Jens Axboe
  -1 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2017-08-03 21:17 UTC (permalink / raw)
  To: Ross Zwisler, Minchan Kim
  Cc: Jerome Marchand, linux-nvdimm, Dave Chinner, linux-kernel,
	Matthew Wilcox, Christoph Hellwig, Jan Kara, Andrew Morton,
	karam . lee, seungho1.park, Nitin Gupta

On 08/03/2017 03:13 PM, Ross Zwisler wrote:
> On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
>> Hi Ross,
>>
>> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
>>> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
>>>> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
>>>>> Dan Williams and Christoph Hellwig have recently expressed doubt about
>>>>> whether the rw_page() interface made sense for synchronous memory drivers
>>>>> [1][2].  It's unclear whether this interface has any performance benefit
>>>>> for these drivers, but as we continue to fix bugs it is clear that it does
>>>>> have a maintenance burden.  This series removes the rw_page()
>>>>> implementations in brd, pmem and btt to relieve this burden.
>>>>
>>>> Why don't you measure whether it has performance benefits?  I don't
>>>> understand why zram would see performance benefits and not other drivers.
>>>> If it's going to be removed, then the whole interface should be removed,
>>>> not just have the implementations removed from some drivers.
>>>
>>> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
>>> points for rw_pages() in a swap workload, and in all cases I do see an
>>> improvement over the code when rw_pages() is removed.  Here are the results
>>> from my random lab box:
>>>
>>>   Average latency of swap_writepage()
>>> +------+------------+---------+-------------+
>>> |      | no rw_page | rw_page | Improvement |
>>> +-------------------------------------------+
>>> | PMEM |  5.0 us    |  4.7 us |     6%      |
>>> +-------------------------------------------+
>>> |  BTT |  6.8 us    |  6.1 us |    10%      |
>>> +------+------------+---------+-------------+
>>>
>>>   Average latency of swap_readpage()
>>> +------+------------+---------+-------------+
>>> |      | no rw_page | rw_page | Improvement |
>>> +-------------------------------------------+
>>> | PMEM |  3.3 us    |  2.9 us |    12%      |
>>> +-------------------------------------------+
>>> |  BTT |  3.7 us    |  3.4 us |     8%      |
>>> +------+------------+---------+-------------+
>>>
>>> The workload was pmbench, a memory benchmark, run on a system where I had
>>> severely restricted the amount of memory in the system with the 'mem' kernel
>>> command line parameter.  The benchmark was set up to test more memory than I
>>> allowed the OS to have so it spilled over into swap.
>>>
>>> The PMEM or BTT device was set up as my swap device, and during the test I got
>>> a few hundred thousand samples of each of swap_writepage() and
>>> swap_writepage().  The PMEM/BTT device was just memory reserved with the
>>> memmap kernel command line parameter.
>>>
>>> Thanks, Matthew, for asking for performance data.  It looks like removing this
>>> code would have been a mistake.
>>
>> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
>> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
>> worth to send mainline yet but I believe it's enough to test the improvement.
>>
>> Could you test patchset on pmem and btt without rw_page?
>>
>> For working the patch, block drivers need to declare it's synchronous IO
>> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
>> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
>>
>> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
>>
>> Patchset is based on 4.13-rc3.
> 
> Thanks for the patch, here are the updated results from my test box:
> 
>  Average latency of swap_writepage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  5.0 us    | 4.98 us |  4.7 us |
> +----------------------------------------
> |  BTT |  6.8 us    | 6.3 us  |  6.1 us |
> +------+------------+---------+---------+
>   				   
>  Average latency of swap_readpage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  3.3 us    | 3.27 us |  2.9 us |
> +----------------------------------------
> |  BTT |  3.7 us    | 3.44 us |  3.4 us |
> +------+------------+---------+---------+
> 
> I've added another digit in precision in some cases to help differentiate the
> various results.
> 
> In all cases your patches did perform better than with the regularly allocated
> BIO, but again for all cases the rw_page() path was the fastest, even if only
> marginally.

IMHO, the win needs to be pretty substantial to justify keeping a
parallel read/write path in the kernel. The recent work of making
O_DIRECT faster is exactly the same as what Minchan did here for sync
IO. I would greatly prefer one fast path, instead of one fast and one
that's just a little faster for some things. It's much better to get
everyone behind one path/stack, and make that as fast as it can be.

-- 
Jens Axboe

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-03 21:17           ` Jens Axboe
  0 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2017-08-03 21:17 UTC (permalink / raw)
  To: Ross Zwisler, Minchan Kim
  Cc: Matthew Wilcox, Andrew Morton, linux-kernel, karam . lee,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Vishal Verma, linux-nvdimm

On 08/03/2017 03:13 PM, Ross Zwisler wrote:
> On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
>> Hi Ross,
>>
>> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
>>> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
>>>> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
>>>>> Dan Williams and Christoph Hellwig have recently expressed doubt about
>>>>> whether the rw_page() interface made sense for synchronous memory drivers
>>>>> [1][2].  It's unclear whether this interface has any performance benefit
>>>>> for these drivers, but as we continue to fix bugs it is clear that it does
>>>>> have a maintenance burden.  This series removes the rw_page()
>>>>> implementations in brd, pmem and btt to relieve this burden.
>>>>
>>>> Why don't you measure whether it has performance benefits?  I don't
>>>> understand why zram would see performance benefits and not other drivers.
>>>> If it's going to be removed, then the whole interface should be removed,
>>>> not just have the implementations removed from some drivers.
>>>
>>> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
>>> points for rw_pages() in a swap workload, and in all cases I do see an
>>> improvement over the code when rw_pages() is removed.  Here are the results
>>> from my random lab box:
>>>
>>>   Average latency of swap_writepage()
>>> +------+------------+---------+-------------+
>>> |      | no rw_page | rw_page | Improvement |
>>> +-------------------------------------------+
>>> | PMEM |  5.0 us    |  4.7 us |     6%      |
>>> +-------------------------------------------+
>>> |  BTT |  6.8 us    |  6.1 us |    10%      |
>>> +------+------------+---------+-------------+
>>>
>>>   Average latency of swap_readpage()
>>> +------+------------+---------+-------------+
>>> |      | no rw_page | rw_page | Improvement |
>>> +-------------------------------------------+
>>> | PMEM |  3.3 us    |  2.9 us |    12%      |
>>> +-------------------------------------------+
>>> |  BTT |  3.7 us    |  3.4 us |     8%      |
>>> +------+------------+---------+-------------+
>>>
>>> The workload was pmbench, a memory benchmark, run on a system where I had
>>> severely restricted the amount of memory in the system with the 'mem' kernel
>>> command line parameter.  The benchmark was set up to test more memory than I
>>> allowed the OS to have so it spilled over into swap.
>>>
>>> The PMEM or BTT device was set up as my swap device, and during the test I got
>>> a few hundred thousand samples of each of swap_writepage() and
>>> swap_writepage().  The PMEM/BTT device was just memory reserved with the
>>> memmap kernel command line parameter.
>>>
>>> Thanks, Matthew, for asking for performance data.  It looks like removing this
>>> code would have been a mistake.
>>
>> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
>> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
>> worth to send mainline yet but I believe it's enough to test the improvement.
>>
>> Could you test patchset on pmem and btt without rw_page?
>>
>> For working the patch, block drivers need to declare it's synchronous IO
>> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
>> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
>>
>> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
>>
>> Patchset is based on 4.13-rc3.
> 
> Thanks for the patch, here are the updated results from my test box:
> 
>  Average latency of swap_writepage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  5.0 us    | 4.98 us |  4.7 us |
> +----------------------------------------
> |  BTT |  6.8 us    | 6.3 us  |  6.1 us |
> +------+------------+---------+---------+
>   				   
>  Average latency of swap_readpage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  3.3 us    | 3.27 us |  2.9 us |
> +----------------------------------------
> |  BTT |  3.7 us    | 3.44 us |  3.4 us |
> +------+------------+---------+---------+
> 
> I've added another digit in precision in some cases to help differentiate the
> various results.
> 
> In all cases your patches did perform better than with the regularly allocated
> BIO, but again for all cases the rw_page() path was the fastest, even if only
> marginally.

IMHO, the win needs to be pretty substantial to justify keeping a
parallel read/write path in the kernel. The recent work of making
O_DIRECT faster is exactly the same as what Minchan did here for sync
IO. I would greatly prefer one fast path, instead of one fast and one
that's just a little faster for some things. It's much better to get
everyone behind one path/stack, and make that as fast as it can be.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-03  8:05         ` Christoph Hellwig
@ 2017-08-04  0:57           ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-04  0:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, seungho1.park, Jan Kara,
	karam . lee, Andrew Morton, Nitin Gupta

On Thu, Aug 03, 2017 at 10:05:44AM +0200, Christoph Hellwig wrote:
> FYI, for the read side we should use the on-stack bio unconditionally,
> as it will always be a win (or not show up at all).

Think about readahead. Unconditional on-stack bio to read around pages
with faulted address will cause latency peek. So, I want to use that
synchronous IO only if device says "Hey, I'm a synchronous".
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-04  0:57           ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-04  0:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Thu, Aug 03, 2017 at 10:05:44AM +0200, Christoph Hellwig wrote:
> FYI, for the read side we should use the on-stack bio unconditionally,
> as it will always be a win (or not show up at all).

Think about readahead. Unconditional on-stack bio to read around pages
with faulted address will cause latency peek. So, I want to use that
synchronous IO only if device says "Hey, I'm a synchronous".

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-03 21:13         ` Ross Zwisler
@ 2017-08-04  3:54           ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-04  3:54 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Jan Kara,
	Andrew Morton, karam . lee, seungho1.park, Nitin Gupta

On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote:
> On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> > Hi Ross,
> > 
> > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > > have a maintenance burden.  This series removes the rw_page()
> > > > > implementations in brd, pmem and btt to relieve this burden.
> > > > 
> > > > Why don't you measure whether it has performance benefits?  I don't
> > > > understand why zram would see performance benefits and not other drivers.
> > > > If it's going to be removed, then the whole interface should be removed,
> > > > not just have the implementations removed from some drivers.
> > > 
> > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > > points for rw_pages() in a swap workload, and in all cases I do see an
> > > improvement over the code when rw_pages() is removed.  Here are the results
> > > from my random lab box:
> > > 
> > >   Average latency of swap_writepage()
> > > +------+------------+---------+-------------+
> > > |      | no rw_page | rw_page | Improvement |
> > > +-------------------------------------------+
> > > | PMEM |  5.0 us    |  4.7 us |     6%      |
> > > +-------------------------------------------+
> > > |  BTT |  6.8 us    |  6.1 us |    10%      |
> > > +------+------------+---------+-------------+
> > > 
> > >   Average latency of swap_readpage()
> > > +------+------------+---------+-------------+
> > > |      | no rw_page | rw_page | Improvement |
> > > +-------------------------------------------+
> > > | PMEM |  3.3 us    |  2.9 us |    12%      |
> > > +-------------------------------------------+
> > > |  BTT |  3.7 us    |  3.4 us |     8%      |
> > > +------+------------+---------+-------------+
> > > 
> > > The workload was pmbench, a memory benchmark, run on a system where I had
> > > severely restricted the amount of memory in the system with the 'mem' kernel
> > > command line parameter.  The benchmark was set up to test more memory than I
> > > allowed the OS to have so it spilled over into swap.
> > > 
> > > The PMEM or BTT device was set up as my swap device, and during the test I got
> > > a few hundred thousand samples of each of swap_writepage() and
> > > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > > memmap kernel command line parameter.
> > > 
> > > Thanks, Matthew, for asking for performance data.  It looks like removing this
> > > code would have been a mistake.
> > 
> > By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> > dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> > worth to send mainline yet but I believe it's enough to test the improvement.
> > 
> > Could you test patchset on pmem and btt without rw_page?
> > 
> > For working the patch, block drivers need to declare it's synchronous IO
> > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> > comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> > 
> > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> > 
> > Patchset is based on 4.13-rc3.
> 
> Thanks for the patch, here are the updated results from my test box:
> 
>  Average latency of swap_writepage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  5.0 us    | 4.98 us |  4.7 us |
> +----------------------------------------
> |  BTT |  6.8 us    | 6.3 us  |  6.1 us |
> +------+------------+---------+---------+
>   				   
>  Average latency of swap_readpage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  3.3 us    | 3.27 us |  2.9 us |
> +----------------------------------------
> |  BTT |  3.7 us    | 3.44 us |  3.4 us |
> +------+------------+---------+---------+
> 
> I've added another digit in precision in some cases to help differentiate the
> various results.
> 
> In all cases your patches did perform better than with the regularly allocated
> BIO, but again for all cases the rw_page() path was the fastest, even if only
> marginally.

Thanks for the testing. Your testing number is within noise level?

I cannot understand why PMEM doesn't have enough gain while BTT is significant
win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
allocation and mine and rw_page testing reduced it significantly. However,
in no rw_page with pmem, there wasn't many cases to wait bio allocations due
to the device is so fast so the number comes from purely the number of
instructions has done. At a quick glance of bio init/submit, it's not trivial
so indeed, i understand where the 12% enhancement comes from but I'm not sure
it's really big difference in real practice at the cost of maintaince burden.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-04  3:54           ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-04  3:54 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Andrew Morton, linux-kernel, karam . lee,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote:
> On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> > Hi Ross,
> > 
> > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > > have a maintenance burden.  This series removes the rw_page()
> > > > > implementations in brd, pmem and btt to relieve this burden.
> > > > 
> > > > Why don't you measure whether it has performance benefits?  I don't
> > > > understand why zram would see performance benefits and not other drivers.
> > > > If it's going to be removed, then the whole interface should be removed,
> > > > not just have the implementations removed from some drivers.
> > > 
> > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > > points for rw_pages() in a swap workload, and in all cases I do see an
> > > improvement over the code when rw_pages() is removed.  Here are the results
> > > from my random lab box:
> > > 
> > >   Average latency of swap_writepage()
> > > +------+------------+---------+-------------+
> > > |      | no rw_page | rw_page | Improvement |
> > > +-------------------------------------------+
> > > | PMEM |  5.0 us    |  4.7 us |     6%      |
> > > +-------------------------------------------+
> > > |  BTT |  6.8 us    |  6.1 us |    10%      |
> > > +------+------------+---------+-------------+
> > > 
> > >   Average latency of swap_readpage()
> > > +------+------------+---------+-------------+
> > > |      | no rw_page | rw_page | Improvement |
> > > +-------------------------------------------+
> > > | PMEM |  3.3 us    |  2.9 us |    12%      |
> > > +-------------------------------------------+
> > > |  BTT |  3.7 us    |  3.4 us |     8%      |
> > > +------+------------+---------+-------------+
> > > 
> > > The workload was pmbench, a memory benchmark, run on a system where I had
> > > severely restricted the amount of memory in the system with the 'mem' kernel
> > > command line parameter.  The benchmark was set up to test more memory than I
> > > allowed the OS to have so it spilled over into swap.
> > > 
> > > The PMEM or BTT device was set up as my swap device, and during the test I got
> > > a few hundred thousand samples of each of swap_writepage() and
> > > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > > memmap kernel command line parameter.
> > > 
> > > Thanks, Matthew, for asking for performance data.  It looks like removing this
> > > code would have been a mistake.
> > 
> > By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> > dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> > worth to send mainline yet but I believe it's enough to test the improvement.
> > 
> > Could you test patchset on pmem and btt without rw_page?
> > 
> > For working the patch, block drivers need to declare it's synchronous IO
> > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> > comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> > 
> > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> > 
> > Patchset is based on 4.13-rc3.
> 
> Thanks for the patch, here are the updated results from my test box:
> 
>  Average latency of swap_writepage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  5.0 us    | 4.98 us |  4.7 us |
> +----------------------------------------
> |  BTT |  6.8 us    | 6.3 us  |  6.1 us |
> +------+------------+---------+---------+
>   				   
>  Average latency of swap_readpage()
> +------+------------+---------+---------+
> |      | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM |  3.3 us    | 3.27 us |  2.9 us |
> +----------------------------------------
> |  BTT |  3.7 us    | 3.44 us |  3.4 us |
> +------+------------+---------+---------+
> 
> I've added another digit in precision in some cases to help differentiate the
> various results.
> 
> In all cases your patches did perform better than with the regularly allocated
> BIO, but again for all cases the rw_page() path was the fastest, even if only
> marginally.

Thanks for the testing. Your testing number is within noise level?

I cannot understand why PMEM doesn't have enough gain while BTT is significant
win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
allocation and mine and rw_page testing reduced it significantly. However,
in no rw_page with pmem, there wasn't many cases to wait bio allocations due
to the device is so fast so the number comes from purely the number of
instructions has done. At a quick glance of bio init/submit, it's not trivial
so indeed, i understand where the 12% enhancement comes from but I'm not sure
it's really big difference in real practice at the cost of maintaince burden.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-04  3:54           ` Minchan Kim
@ 2017-08-04  8:17             ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-04  8:17 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Jan Kara,
	Andrew Morton, karam . lee, seungho1.park, Nitin Gupta

On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote:
> > On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> > > Hi Ross,
> > > 
> > > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > > > have a maintenance burden.  This series removes the rw_page()
> > > > > > implementations in brd, pmem and btt to relieve this burden.
> > > > > 
> > > > > Why don't you measure whether it has performance benefits?  I don't
> > > > > understand why zram would see performance benefits and not other drivers.
> > > > > If it's going to be removed, then the whole interface should be removed,
> > > > > not just have the implementations removed from some drivers.
> > > > 
> > > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > > > points for rw_pages() in a swap workload, and in all cases I do see an
> > > > improvement over the code when rw_pages() is removed.  Here are the results
> > > > from my random lab box:
> > > > 
> > > >   Average latency of swap_writepage()
> > > > +------+------------+---------+-------------+
> > > > |      | no rw_page | rw_page | Improvement |
> > > > +-------------------------------------------+
> > > > | PMEM |  5.0 us    |  4.7 us |     6%      |
> > > > +-------------------------------------------+
> > > > |  BTT |  6.8 us    |  6.1 us |    10%      |
> > > > +------+------------+---------+-------------+
> > > > 
> > > >   Average latency of swap_readpage()
> > > > +------+------------+---------+-------------+
> > > > |      | no rw_page | rw_page | Improvement |
> > > > +-------------------------------------------+
> > > > | PMEM |  3.3 us    |  2.9 us |    12%      |
> > > > +-------------------------------------------+
> > > > |  BTT |  3.7 us    |  3.4 us |     8%      |
> > > > +------+------------+---------+-------------+
> > > > 
> > > > The workload was pmbench, a memory benchmark, run on a system where I had
> > > > severely restricted the amount of memory in the system with the 'mem' kernel
> > > > command line parameter.  The benchmark was set up to test more memory than I
> > > > allowed the OS to have so it spilled over into swap.
> > > > 
> > > > The PMEM or BTT device was set up as my swap device, and during the test I got
> > > > a few hundred thousand samples of each of swap_writepage() and
> > > > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > > > memmap kernel command line parameter.
> > > > 
> > > > Thanks, Matthew, for asking for performance data.  It looks like removing this
> > > > code would have been a mistake.
> > > 
> > > By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> > > dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> > > worth to send mainline yet but I believe it's enough to test the improvement.
> > > 
> > > Could you test patchset on pmem and btt without rw_page?
> > > 
> > > For working the patch, block drivers need to declare it's synchronous IO
> > > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> > > comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> > > 
> > > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> > > 
> > > Patchset is based on 4.13-rc3.
> > 
> > Thanks for the patch, here are the updated results from my test box:
> > 
> >  Average latency of swap_writepage()
> > +------+------------+---------+---------+
> > |      | no rw_page | minchan | rw_page |
> > +----------------------------------------
> > | PMEM |  5.0 us    | 4.98 us |  4.7 us |
> > +----------------------------------------
> > |  BTT |  6.8 us    | 6.3 us  |  6.1 us |
> > +------+------------+---------+---------+
> >   				   
> >  Average latency of swap_readpage()
> > +------+------------+---------+---------+
> > |      | no rw_page | minchan | rw_page |
> > +----------------------------------------
> > | PMEM |  3.3 us    | 3.27 us |  2.9 us |
> > +----------------------------------------
> > |  BTT |  3.7 us    | 3.44 us |  3.4 us |
> > +------+------------+---------+---------+
> > 
> > I've added another digit in precision in some cases to help differentiate the
> > various results.
> > 
> > In all cases your patches did perform better than with the regularly allocated
> > BIO, but again for all cases the rw_page() path was the fastest, even if only
> > marginally.
> 
> Thanks for the testing. Your testing number is within noise level?
> 
> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> allocation and mine and rw_page testing reduced it significantly. However,
> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> to the device is so fast so the number comes from purely the number of
> instructions has done. At a quick glance of bio init/submit, it's not trivial
> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> it's really big difference in real practice at the cost of maintaince burden.

I tested pmbench 10 times in my local machine(4 core) with zram-swap.
In my machine, even, on-stack bio is faster than rw_page. Unbelievable.

I guess it's really hard to get stable result in severe memory pressure.
It would be a result within noise level(see below stddev).
So, I think it's hard to conclude rw_page is far faster than onstack-bio.

rw_page
avg     5.54us
stddev  8.89%
max     6.02us
min     4.20us

onstack bio
avg     5.27us
stddev  13.03%
max     5.96us
min     3.55us
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-04  8:17             ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-04  8:17 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Andrew Morton, linux-kernel, karam . lee,
	Jerome Marchand, Nitin Gupta, seungho1.park, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Jens Axboe, Vishal Verma,
	linux-nvdimm

On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote:
> > On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> > > Hi Ross,
> > > 
> > > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > > > have a maintenance burden.  This series removes the rw_page()
> > > > > > implementations in brd, pmem and btt to relieve this burden.
> > > > > 
> > > > > Why don't you measure whether it has performance benefits?  I don't
> > > > > understand why zram would see performance benefits and not other drivers.
> > > > > If it's going to be removed, then the whole interface should be removed,
> > > > > not just have the implementations removed from some drivers.
> > > > 
> > > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > > > points for rw_pages() in a swap workload, and in all cases I do see an
> > > > improvement over the code when rw_pages() is removed.  Here are the results
> > > > from my random lab box:
> > > > 
> > > >   Average latency of swap_writepage()
> > > > +------+------------+---------+-------------+
> > > > |      | no rw_page | rw_page | Improvement |
> > > > +-------------------------------------------+
> > > > | PMEM |  5.0 us    |  4.7 us |     6%      |
> > > > +-------------------------------------------+
> > > > |  BTT |  6.8 us    |  6.1 us |    10%      |
> > > > +------+------------+---------+-------------+
> > > > 
> > > >   Average latency of swap_readpage()
> > > > +------+------------+---------+-------------+
> > > > |      | no rw_page | rw_page | Improvement |
> > > > +-------------------------------------------+
> > > > | PMEM |  3.3 us    |  2.9 us |    12%      |
> > > > +-------------------------------------------+
> > > > |  BTT |  3.7 us    |  3.4 us |     8%      |
> > > > +------+------------+---------+-------------+
> > > > 
> > > > The workload was pmbench, a memory benchmark, run on a system where I had
> > > > severely restricted the amount of memory in the system with the 'mem' kernel
> > > > command line parameter.  The benchmark was set up to test more memory than I
> > > > allowed the OS to have so it spilled over into swap.
> > > > 
> > > > The PMEM or BTT device was set up as my swap device, and during the test I got
> > > > a few hundred thousand samples of each of swap_writepage() and
> > > > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > > > memmap kernel command line parameter.
> > > > 
> > > > Thanks, Matthew, for asking for performance data.  It looks like removing this
> > > > code would have been a mistake.
> > > 
> > > By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> > > dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> > > worth to send mainline yet but I believe it's enough to test the improvement.
> > > 
> > > Could you test patchset on pmem and btt without rw_page?
> > > 
> > > For working the patch, block drivers need to declare it's synchronous IO
> > > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> > > comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> > > 
> > > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> > > 
> > > Patchset is based on 4.13-rc3.
> > 
> > Thanks for the patch, here are the updated results from my test box:
> > 
> >  Average latency of swap_writepage()
> > +------+------------+---------+---------+
> > |      | no rw_page | minchan | rw_page |
> > +----------------------------------------
> > | PMEM |  5.0 us    | 4.98 us |  4.7 us |
> > +----------------------------------------
> > |  BTT |  6.8 us    | 6.3 us  |  6.1 us |
> > +------+------------+---------+---------+
> >   				   
> >  Average latency of swap_readpage()
> > +------+------------+---------+---------+
> > |      | no rw_page | minchan | rw_page |
> > +----------------------------------------
> > | PMEM |  3.3 us    | 3.27 us |  2.9 us |
> > +----------------------------------------
> > |  BTT |  3.7 us    | 3.44 us |  3.4 us |
> > +------+------------+---------+---------+
> > 
> > I've added another digit in precision in some cases to help differentiate the
> > various results.
> > 
> > In all cases your patches did perform better than with the regularly allocated
> > BIO, but again for all cases the rw_page() path was the fastest, even if only
> > marginally.
> 
> Thanks for the testing. Your testing number is within noise level?
> 
> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> allocation and mine and rw_page testing reduced it significantly. However,
> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> to the device is so fast so the number comes from purely the number of
> instructions has done. At a quick glance of bio init/submit, it's not trivial
> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> it's really big difference in real practice at the cost of maintaince burden.

I tested pmbench 10 times in my local machine(4 core) with zram-swap.
In my machine, even, on-stack bio is faster than rw_page. Unbelievable.

I guess it's really hard to get stable result in severe memory pressure.
It would be a result within noise level(see below stddev).
So, I think it's hard to conclude rw_page is far faster than onstack-bio.

rw_page
avg     5.54us
stddev  8.89%
max     6.02us
min     4.20us

onstack bio
avg     5.27us
stddev  13.03%
max     5.96us
min     3.55us

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-04  8:17             ` Minchan Kim
@ 2017-08-04 18:01               ` Dan Williams
  -1 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2017-08-04 18:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, seungho1.park,
	Jan Kara, karam . lee, Andrew Morton, Nitin Gupta

[ adding Dave who is working on a blk-mq + dma offload version of the
pmem driver ]

On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
> On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
[..]
>> Thanks for the testing. Your testing number is within noise level?
>>
>> I cannot understand why PMEM doesn't have enough gain while BTT is significant
>> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
>> allocation and mine and rw_page testing reduced it significantly. However,
>> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
>> to the device is so fast so the number comes from purely the number of
>> instructions has done. At a quick glance of bio init/submit, it's not trivial
>> so indeed, i understand where the 12% enhancement comes from but I'm not sure
>> it's really big difference in real practice at the cost of maintaince burden.
>
> I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>
> I guess it's really hard to get stable result in severe memory pressure.
> It would be a result within noise level(see below stddev).
> So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>
> rw_page
> avg     5.54us
> stddev  8.89%
> max     6.02us
> min     4.20us
>
> onstack bio
> avg     5.27us
> stddev  13.03%
> max     5.96us
> min     3.55us

The maintenance burden of having alternative submission paths is
significant especially as we consider the pmem driver ising more
services of the core block layer. Ideally, I'd want to complete the
rw_page removal work before we look at the blk-mq + dma offload
reworks.

The change to introduce BDI_CAP_SYNC is interesting because we might
have use for switching between dma offload and cpu copy based on
whether the I/O is synchronous or otherwise hinted to be a low latency
request. Right now the dma offload patches are using "bio_segments() >
1" as the gate for selecting offload vs cpu copy which seem
inadequate.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-04 18:01               ` Dan Williams
  0 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2017-08-04 18:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dave Chinner, Jan Kara, Jens Axboe,
	Vishal Verma, linux-nvdimm, Dave Jiang

[ adding Dave who is working on a blk-mq + dma offload version of the
pmem driver ]

On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
> On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
[..]
>> Thanks for the testing. Your testing number is within noise level?
>>
>> I cannot understand why PMEM doesn't have enough gain while BTT is significant
>> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
>> allocation and mine and rw_page testing reduced it significantly. However,
>> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
>> to the device is so fast so the number comes from purely the number of
>> instructions has done. At a quick glance of bio init/submit, it's not trivial
>> so indeed, i understand where the 12% enhancement comes from but I'm not sure
>> it's really big difference in real practice at the cost of maintaince burden.
>
> I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>
> I guess it's really hard to get stable result in severe memory pressure.
> It would be a result within noise level(see below stddev).
> So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>
> rw_page
> avg     5.54us
> stddev  8.89%
> max     6.02us
> min     4.20us
>
> onstack bio
> avg     5.27us
> stddev  13.03%
> max     5.96us
> min     3.55us

The maintenance burden of having alternative submission paths is
significant especially as we consider the pmem driver ising more
services of the core block layer. Ideally, I'd want to complete the
rw_page removal work before we look at the blk-mq + dma offload
reworks.

The change to introduce BDI_CAP_SYNC is interesting because we might
have use for switching between dma offload and cpu copy based on
whether the I/O is synchronous or otherwise hinted to be a low latency
request. Right now the dma offload patches are using "bio_segments() >
1" as the gate for selecting offload vs cpu copy which seem
inadequate.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-04 18:01               ` Dan Williams
@ 2017-08-04 18:21                 ` Ross Zwisler
  -1 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-08-04 18:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Minchan Kim,
	seungho1.park, Jan Kara, karam . lee, Andrew Morton, Nitin Gupta

On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> [ adding Dave who is working on a blk-mq + dma offload version of the
> pmem driver ]
> 
> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> [..]
> >> Thanks for the testing. Your testing number is within noise level?
> >>
> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> >> allocation and mine and rw_page testing reduced it significantly. However,
> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> >> to the device is so fast so the number comes from purely the number of
> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> >> it's really big difference in real practice at the cost of maintaince burden.
> >
> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> >
> > I guess it's really hard to get stable result in severe memory pressure.
> > It would be a result within noise level(see below stddev).
> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> >
> > rw_page
> > avg     5.54us
> > stddev  8.89%
> > max     6.02us
> > min     4.20us
> >
> > onstack bio
> > avg     5.27us
> > stddev  13.03%
> > max     5.96us
> > min     3.55us
> 
> The maintenance burden of having alternative submission paths is
> significant especially as we consider the pmem driver ising more
> services of the core block layer. Ideally, I'd want to complete the
> rw_page removal work before we look at the blk-mq + dma offload
> reworks.
> 
> The change to introduce BDI_CAP_SYNC is interesting because we might
> have use for switching between dma offload and cpu copy based on
> whether the I/O is synchronous or otherwise hinted to be a low latency
> request. Right now the dma offload patches are using "bio_segments() >
> 1" as the gate for selecting offload vs cpu copy which seem
> inadequate.

Okay, so based on the feedback above and from Jens[1], it sounds like we want
to go forward with removing the rw_page() interface, and instead optimize the
regular I/O path via on-stack BIOS and dma offload, correct?

If so, I'll prepare patches that fully remove the rw_page() code, and let
Minchan and Dave work on their optimizations.

[1]: https://lkml.org/lkml/2017/8/3/803
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-04 18:21                 ` Ross Zwisler
  0 siblings, 0 replies; 54+ messages in thread
From: Ross Zwisler @ 2017-08-04 18:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Minchan Kim, Ross Zwisler, Matthew Wilcox, Andrew Morton,
	linux-kernel, karam . lee, Jerome Marchand, Nitin Gupta,
	seungho1.park, Christoph Hellwig, Dave Chinner, Jan Kara,
	Jens Axboe, Vishal Verma, linux-nvdimm, Dave Jiang

On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> [ adding Dave who is working on a blk-mq + dma offload version of the
> pmem driver ]
> 
> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> [..]
> >> Thanks for the testing. Your testing number is within noise level?
> >>
> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> >> allocation and mine and rw_page testing reduced it significantly. However,
> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> >> to the device is so fast so the number comes from purely the number of
> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> >> it's really big difference in real practice at the cost of maintaince burden.
> >
> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> >
> > I guess it's really hard to get stable result in severe memory pressure.
> > It would be a result within noise level(see below stddev).
> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> >
> > rw_page
> > avg     5.54us
> > stddev  8.89%
> > max     6.02us
> > min     4.20us
> >
> > onstack bio
> > avg     5.27us
> > stddev  13.03%
> > max     5.96us
> > min     3.55us
> 
> The maintenance burden of having alternative submission paths is
> significant especially as we consider the pmem driver ising more
> services of the core block layer. Ideally, I'd want to complete the
> rw_page removal work before we look at the blk-mq + dma offload
> reworks.
> 
> The change to introduce BDI_CAP_SYNC is interesting because we might
> have use for switching between dma offload and cpu copy based on
> whether the I/O is synchronous or otherwise hinted to be a low latency
> request. Right now the dma offload patches are using "bio_segments() >
> 1" as the gate for selecting offload vs cpu copy which seem
> inadequate.

Okay, so based on the feedback above and from Jens[1], it sounds like we want
to go forward with removing the rw_page() interface, and instead optimize the
regular I/O path via on-stack BIOS and dma offload, correct?

If so, I'll prepare patches that fully remove the rw_page() code, and let
Minchan and Dave work on their optimizations.

[1]: https://lkml.org/lkml/2017/8/3/803

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-04 18:21                 ` Ross Zwisler
@ 2017-08-04 18:24                   ` Dan Williams
  -1 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2017-08-04 18:24 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, Minchan Kim,
	Jan Kara, Andrew Morton, karam . lee, seungho1.park, Nitin Gupta

On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
>> [ adding Dave who is working on a blk-mq + dma offload version of the
>> pmem driver ]
>>
>> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
>> [..]
>> >> Thanks for the testing. Your testing number is within noise level?
>> >>
>> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
>> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
>> >> allocation and mine and rw_page testing reduced it significantly. However,
>> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
>> >> to the device is so fast so the number comes from purely the number of
>> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
>> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
>> >> it's really big difference in real practice at the cost of maintaince burden.
>> >
>> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
>> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>> >
>> > I guess it's really hard to get stable result in severe memory pressure.
>> > It would be a result within noise level(see below stddev).
>> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>> >
>> > rw_page
>> > avg     5.54us
>> > stddev  8.89%
>> > max     6.02us
>> > min     4.20us
>> >
>> > onstack bio
>> > avg     5.27us
>> > stddev  13.03%
>> > max     5.96us
>> > min     3.55us
>>
>> The maintenance burden of having alternative submission paths is
>> significant especially as we consider the pmem driver ising more
>> services of the core block layer. Ideally, I'd want to complete the
>> rw_page removal work before we look at the blk-mq + dma offload
>> reworks.
>>
>> The change to introduce BDI_CAP_SYNC is interesting because we might
>> have use for switching between dma offload and cpu copy based on
>> whether the I/O is synchronous or otherwise hinted to be a low latency
>> request. Right now the dma offload patches are using "bio_segments() >
>> 1" as the gate for selecting offload vs cpu copy which seem
>> inadequate.
>
> Okay, so based on the feedback above and from Jens[1], it sounds like we want
> to go forward with removing the rw_page() interface, and instead optimize the
> regular I/O path via on-stack BIOS and dma offload, correct?
>
> If so, I'll prepare patches that fully remove the rw_page() code, and let
> Minchan and Dave work on their optimizations.

I think the conversion to on-stack-bio should be done in the same
patchset that removes rw_page. We don't want to leave a known
performance regression while the on-stack-bio work is in-flight.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-04 18:24                   ` Dan Williams
  0 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2017-08-04 18:24 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Minchan Kim, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dave Chinner, Jan Kara, Jens Axboe,
	Vishal Verma, linux-nvdimm, Dave Jiang

On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
>> [ adding Dave who is working on a blk-mq + dma offload version of the
>> pmem driver ]
>>
>> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
>> [..]
>> >> Thanks for the testing. Your testing number is within noise level?
>> >>
>> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
>> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
>> >> allocation and mine and rw_page testing reduced it significantly. However,
>> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
>> >> to the device is so fast so the number comes from purely the number of
>> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
>> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
>> >> it's really big difference in real practice at the cost of maintaince burden.
>> >
>> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
>> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>> >
>> > I guess it's really hard to get stable result in severe memory pressure.
>> > It would be a result within noise level(see below stddev).
>> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>> >
>> > rw_page
>> > avg     5.54us
>> > stddev  8.89%
>> > max     6.02us
>> > min     4.20us
>> >
>> > onstack bio
>> > avg     5.27us
>> > stddev  13.03%
>> > max     5.96us
>> > min     3.55us
>>
>> The maintenance burden of having alternative submission paths is
>> significant especially as we consider the pmem driver ising more
>> services of the core block layer. Ideally, I'd want to complete the
>> rw_page removal work before we look at the blk-mq + dma offload
>> reworks.
>>
>> The change to introduce BDI_CAP_SYNC is interesting because we might
>> have use for switching between dma offload and cpu copy based on
>> whether the I/O is synchronous or otherwise hinted to be a low latency
>> request. Right now the dma offload patches are using "bio_segments() >
>> 1" as the gate for selecting offload vs cpu copy which seem
>> inadequate.
>
> Okay, so based on the feedback above and from Jens[1], it sounds like we want
> to go forward with removing the rw_page() interface, and instead optimize the
> regular I/O path via on-stack BIOS and dma offload, correct?
>
> If so, I'll prepare patches that fully remove the rw_page() code, and let
> Minchan and Dave work on their optimizations.

I think the conversion to on-stack-bio should be done in the same
patchset that removes rw_page. We don't want to leave a known
performance regression while the on-stack-bio work is in-flight.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
  2017-08-04 18:24                   ` Dan Williams
@ 2017-08-07  8:23                     ` Minchan Kim
  -1 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-07  8:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Jerome Marchand, linux-nvdimm, Dave Chinner,
	linux-kernel, Matthew Wilcox, Christoph Hellwig, seungho1.park,
	Jan Kara, karam . lee, Andrew Morton, Nitin Gupta

On Fri, Aug 04, 2017 at 11:24:49AM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> >> [ adding Dave who is working on a blk-mq + dma offload version of the
> >> pmem driver ]
> >>
> >> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> >> [..]
> >> >> Thanks for the testing. Your testing number is within noise level?
> >> >>
> >> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> >> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> >> >> allocation and mine and rw_page testing reduced it significantly. However,
> >> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> >> >> to the device is so fast so the number comes from purely the number of
> >> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
> >> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> >> >> it's really big difference in real practice at the cost of maintaince burden.
> >> >
> >> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> >> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> >> >
> >> > I guess it's really hard to get stable result in severe memory pressure.
> >> > It would be a result within noise level(see below stddev).
> >> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> >> >
> >> > rw_page
> >> > avg     5.54us
> >> > stddev  8.89%
> >> > max     6.02us
> >> > min     4.20us
> >> >
> >> > onstack bio
> >> > avg     5.27us
> >> > stddev  13.03%
> >> > max     5.96us
> >> > min     3.55us
> >>
> >> The maintenance burden of having alternative submission paths is
> >> significant especially as we consider the pmem driver ising more
> >> services of the core block layer. Ideally, I'd want to complete the
> >> rw_page removal work before we look at the blk-mq + dma offload
> >> reworks.
> >>
> >> The change to introduce BDI_CAP_SYNC is interesting because we might
> >> have use for switching between dma offload and cpu copy based on
> >> whether the I/O is synchronous or otherwise hinted to be a low latency
> >> request. Right now the dma offload patches are using "bio_segments() >
> >> 1" as the gate for selecting offload vs cpu copy which seem
> >> inadequate.
> >
> > Okay, so based on the feedback above and from Jens[1], it sounds like we want
> > to go forward with removing the rw_page() interface, and instead optimize the
> > regular I/O path via on-stack BIOS and dma offload, correct?
> >
> > If so, I'll prepare patches that fully remove the rw_page() code, and let
> > Minchan and Dave work on their optimizations.
> 
> I think the conversion to on-stack-bio should be done in the same
> patchset that removes rw_page. We don't want to leave a known
> performance regression while the on-stack-bio work is in-flight.

Okay. It seems everyone get an agreement with on-stack-bio.
I will send my formal patchset including Ross's patches which
removes rw_page.

Thanks.

Thanks.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
@ 2017-08-07  8:23                     ` Minchan Kim
  0 siblings, 0 replies; 54+ messages in thread
From: Minchan Kim @ 2017-08-07  8:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	karam . lee, Jerome Marchand, Nitin Gupta, seungho1.park,
	Christoph Hellwig, Dave Chinner, Jan Kara, Jens Axboe,
	Vishal Verma, linux-nvdimm, Dave Jiang

On Fri, Aug 04, 2017 at 11:24:49AM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> >> [ adding Dave who is working on a blk-mq + dma offload version of the
> >> pmem driver ]
> >>
> >> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> >> [..]
> >> >> Thanks for the testing. Your testing number is within noise level?
> >> >>
> >> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> >> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> >> >> allocation and mine and rw_page testing reduced it significantly. However,
> >> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> >> >> to the device is so fast so the number comes from purely the number of
> >> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
> >> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> >> >> it's really big difference in real practice at the cost of maintaince burden.
> >> >
> >> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> >> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> >> >
> >> > I guess it's really hard to get stable result in severe memory pressure.
> >> > It would be a result within noise level(see below stddev).
> >> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> >> >
> >> > rw_page
> >> > avg     5.54us
> >> > stddev  8.89%
> >> > max     6.02us
> >> > min     4.20us
> >> >
> >> > onstack bio
> >> > avg     5.27us
> >> > stddev  13.03%
> >> > max     5.96us
> >> > min     3.55us
> >>
> >> The maintenance burden of having alternative submission paths is
> >> significant especially as we consider the pmem driver ising more
> >> services of the core block layer. Ideally, I'd want to complete the
> >> rw_page removal work before we look at the blk-mq + dma offload
> >> reworks.
> >>
> >> The change to introduce BDI_CAP_SYNC is interesting because we might
> >> have use for switching between dma offload and cpu copy based on
> >> whether the I/O is synchronous or otherwise hinted to be a low latency
> >> request. Right now the dma offload patches are using "bio_segments() >
> >> 1" as the gate for selecting offload vs cpu copy which seem
> >> inadequate.
> >
> > Okay, so based on the feedback above and from Jens[1], it sounds like we want
> > to go forward with removing the rw_page() interface, and instead optimize the
> > regular I/O path via on-stack BIOS and dma offload, correct?
> >
> > If so, I'll prepare patches that fully remove the rw_page() code, and let
> > Minchan and Dave work on their optimizations.
> 
> I think the conversion to on-stack-bio should be done in the same
> patchset that removes rw_page. We don't want to leave a known
> performance regression while the on-stack-bio work is in-flight.

Okay. It seems everyone get an agreement with on-stack-bio.
I will send my formal patchset including Ross's patches which
removes rw_page.

Thanks.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2017-08-07  8:23 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-28 16:56 [PATCH 0/3] remove rw_page() from brd, pmem and btt Ross Zwisler
2017-07-28 16:56 ` Ross Zwisler
2017-07-28 16:56 ` [PATCH 1/3] btt: remove btt_rw_page() Ross Zwisler
2017-07-28 16:56   ` Ross Zwisler
2017-08-03 16:15   ` kbuild test robot
2017-08-03 16:15     ` kbuild test robot
2017-07-28 16:56 ` [PATCH 2/3] pmem: remove pmem_rw_page() Ross Zwisler
2017-07-28 16:56   ` Ross Zwisler
2017-07-28 16:56 ` [PATCH 3/3] brd: remove brd_rw_page() Ross Zwisler
2017-07-28 16:56   ` Ross Zwisler
2017-07-28 17:31 ` [PATCH 0/3] remove rw_page() from brd, pmem and btt Matthew Wilcox
2017-07-28 17:31   ` Matthew Wilcox
2017-07-28 21:21   ` Andrew Morton
2017-07-28 21:21     ` Andrew Morton
2017-07-30 22:16     ` Minchan Kim
2017-07-30 22:16       ` Minchan Kim
2017-07-30 22:38       ` Minchan Kim
2017-07-30 22:38         ` Minchan Kim
2017-07-31  7:17       ` Christoph Hellwig
2017-07-31  7:17         ` Christoph Hellwig
2017-07-31  7:36         ` Minchan Kim
2017-07-31  7:36           ` Minchan Kim
2017-07-31  7:42           ` Christoph Hellwig
2017-07-31  7:42             ` Christoph Hellwig
2017-07-31  7:44             ` Christoph Hellwig
2017-07-31  7:44               ` Christoph Hellwig
2017-08-01  6:23               ` Minchan Kim
2017-08-01  6:23                 ` Minchan Kim
2017-08-02 22:13   ` Ross Zwisler
2017-08-02 22:13     ` Ross Zwisler
2017-08-03  0:13     ` Minchan Kim
2017-08-03  0:13       ` Minchan Kim
2017-08-03  0:34       ` Dan Williams
2017-08-03  0:34         ` Dan Williams
2017-08-03  8:05       ` Christoph Hellwig
2017-08-03  8:05         ` Christoph Hellwig
2017-08-04  0:57         ` Minchan Kim
2017-08-04  0:57           ` Minchan Kim
2017-08-03 21:13       ` Ross Zwisler
2017-08-03 21:13         ` Ross Zwisler
2017-08-03 21:17         ` Jens Axboe
2017-08-03 21:17           ` Jens Axboe
2017-08-04  3:54         ` Minchan Kim
2017-08-04  3:54           ` Minchan Kim
2017-08-04  8:17           ` Minchan Kim
2017-08-04  8:17             ` Minchan Kim
2017-08-04 18:01             ` Dan Williams
2017-08-04 18:01               ` Dan Williams
2017-08-04 18:21               ` Ross Zwisler
2017-08-04 18:21                 ` Ross Zwisler
2017-08-04 18:24                 ` Dan Williams
2017-08-04 18:24                   ` Dan Williams
2017-08-07  8:23                   ` Minchan Kim
2017-08-07  8:23                     ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.