All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] bcache patches for Linux v5.6
@ 2020-01-23 17:01 colyli
  2020-01-23 17:01 ` [PATCH 01/17] bcache: cached_dev_free needs to put the sb page colyli
                   ` (16 more replies)
  0 siblings, 17 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Hi Jens,

Here are the bcache patches for Linux v5.6.
In this series, we have 5 contributors including me,
- Ben Dooks and Guoju Fang contributes minor changes,
- Liang Chen contributes a patch to fix a bcache super block page leak
  issue.
- Christoph Hellwig contributes a set of changes to improve the on-disk
  format for super block I/O, and his patches pluses Liang Chen's patch
  will make bcache supports kernel page size larger than 4KB. Now bcache
  can work on machines (e.g. ARM64) which have 8KB or larger kernel page
  size.
- I fixes some minor issue from Christoph's patches. And there are
  effort to make bcache shrink btree node cache more aggressive when
  system memory usage is in pressure. There are two importent fixes for
  performance regression (Cc to Linux-stable),
  - Fix extra I/O when aggressively flushing dirty btree node pages in
    bcache journal code. 
  - Back to cache all readahead I/Os for small random I/Os.

There are still some patches under testing. Once they passes my testing,
I will submit them in following v5.6-rc versions.

Please take these patches, and thank you in advance.

Coly Li
---

Ben Dooks (Codethink) (1):
  lib: crc64: include <linux/crc64.h> for 'crc64_be'

Christoph Hellwig (6):
  bcache: use a separate data structure for the on-disk super block
  bcache: rework error unwinding in register_bcache
  bcache: transfer the sb_page reference to register_{bdev,cache}
  bcache: return a pointer to the on-disk sb from read_super
  bcache: store a pointer to the on-disk sb in the cache and cached_dev
    structures
  bcache: use read_cache_page_gfp to read the superblock

Coly Li (8):
  bcache: properly initialize 'path' and 'err' in register_bcache()
  bcache: fix use-after-free in register_bcache()
  bcache: add code comments for state->pool in __btree_sort()
  bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
  bcache: back to cache all readahead I/Os
  bcache: remove member accessed from struct btree
  bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  bcache: reap from tail of c->btree_cache in bch_mca_scan()

Guoju Fang (1):
  bcache: print written and keys in trace_bcache_btree_write

Liang Chen (1):
  bcache: cached_dev_free needs to put the sb page

 drivers/md/bcache/bcache.h    |   2 +
 drivers/md/bcache/bset.c      |   5 ++
 drivers/md/bcache/btree.c     |  24 ++++----
 drivers/md/bcache/btree.h     |   2 -
 drivers/md/bcache/journal.c   |  80 +++++++++++++++++++++++--
 drivers/md/bcache/request.c   |   9 ---
 drivers/md/bcache/super.c     | 136 ++++++++++++++++++++++--------------------
 include/trace/events/bcache.h |   3 +-
 include/uapi/linux/bcache.h   |  52 ++++++++++++++++
 lib/crc64.c                   |   1 +
 10 files changed, 218 insertions(+), 96 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/17] bcache: cached_dev_free needs to put the sb page
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 02/17] bcache: use a separate data structure for the on-disk super block colyli
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Liang Chen, Christoph Hellwig, Coly Li

From: Liang Chen <liangchen.linux@gmail.com>

Same as cache device, the buffer page needs to be put while
freeing cached_dev.  Otherwise a page would be leaked every
time a cached_dev is stopped.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 77e9869345e7..a573ce1d85aa 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1275,6 +1275,9 @@ static void cached_dev_free(struct closure *cl)
 
 	mutex_unlock(&bch_register_lock);
 
+	if (dc->sb_bio.bi_inline_vecs[0].bv_page)
+		put_page(bio_first_page_all(&dc->sb_bio));
+
 	if (!IS_ERR_OR_NULL(dc->bdev))
 		blkdev_put(dc->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/17] bcache: use a separate data structure for the on-disk super block
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
  2020-01-23 17:01 ` [PATCH 01/17] bcache: cached_dev_free needs to put the sb page colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 03/17] bcache: rework error unwinding in register_bcache colyli
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Christoph Hellwig, Coly Li

From: Christoph Hellwig <hch@lst.de>

Split out an on-disk version struct cache_sb with the proper endianness
annotations.  This fixes a fair chunk of sparse warnings, but there are
some left due to the way the checksum is defined.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c   |  6 +++---
 include/uapi/linux/bcache.h | 51 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a573ce1d85aa..3045f27e0d67 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -63,14 +63,14 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 			      struct page **res)
 {
 	const char *err;
-	struct cache_sb *s;
+	struct cache_sb_disk *s;
 	struct buffer_head *bh = __bread(bdev, 1, SB_SIZE);
 	unsigned int i;
 
 	if (!bh)
 		return "IO error";
 
-	s = (struct cache_sb *) bh->b_data;
+	s = (struct cache_sb_disk *)bh->b_data;
 
 	sb->offset		= le64_to_cpu(s->offset);
 	sb->version		= le64_to_cpu(s->version);
@@ -209,7 +209,7 @@ static void write_bdev_super_endio(struct bio *bio)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio_first_page_all(bio));
+	struct cache_sb_disk *out = page_address(bio_first_page_all(bio));
 	unsigned int i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 5d4f58e059fd..1d8b3a9fc080 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -156,6 +156,57 @@ static inline struct bkey *bkey_idx(const struct bkey *k, unsigned int nr_keys)
 
 #define BDEV_DATA_START_DEFAULT		16	/* sectors */
 
+struct cache_sb_disk {
+	__le64			csum;
+	__le64			offset;	/* sector where this sb was written */
+	__le64			version;
+
+	__u8			magic[16];
+
+	__u8			uuid[16];
+	union {
+		__u8		set_uuid[16];
+		__le64		set_magic;
+	};
+	__u8			label[SB_LABEL_SIZE];
+
+	__le64			flags;
+	__le64			seq;
+	__le64			pad[8];
+
+	union {
+	struct {
+		/* Cache devices */
+		__le64		nbuckets;	/* device size */
+
+		__le16		block_size;	/* sectors */
+		__le16		bucket_size;	/* sectors */
+
+		__le16		nr_in_set;
+		__le16		nr_this_dev;
+	};
+	struct {
+		/* Backing devices */
+		__le64		data_offset;
+
+		/*
+		 * block_size from the cache device section is still used by
+		 * backing devices, so don't add anything here until we fix
+		 * things to not need it for backing devices anymore
+		 */
+	};
+	};
+
+	__le32			last_mount;	/* time overflow in y2106 */
+
+	__le16			first_bucket;
+	union {
+		__le16		njournal_buckets;
+		__le16		keys;
+	};
+	__le64			d[SB_JOURNAL_BUCKETS];	/* journal buckets */
+};
+
 struct cache_sb {
 	__u64			csum;
 	__u64			offset;	/* sector where this sb was written */
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/17] bcache: rework error unwinding in register_bcache
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
  2020-01-23 17:01 ` [PATCH 01/17] bcache: cached_dev_free needs to put the sb page colyli
  2020-01-23 17:01 ` [PATCH 02/17] bcache: use a separate data structure for the on-disk super block colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 04/17] bcache: properly initialize 'path' and 'err' in register_bcache() colyli
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Christoph Hellwig, Coly Li

From: Christoph Hellwig <hch@lst.de>

Split the successful and error return path, and use one goto label for each
resource to unwind.  This also fixes some small errors like leaking the
module reference count in the reboot case (which seems entirely harmless)
or printing the wrong warning messages for early failures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 75 ++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3045f27e0d67..e8013e1b0a14 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2375,29 +2375,33 @@ static bool bch_is_open(struct block_device *bdev)
 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size)
 {
-	ssize_t ret = -EINVAL;
-	const char *err = "cannot allocate memory";
-	char *path = NULL;
-	struct cache_sb *sb = NULL;
+	const char *err;
+	char *path;
+	struct cache_sb *sb;
 	struct block_device *bdev = NULL;
-	struct page *sb_page = NULL;
+	struct page *sb_page;
+	ssize_t ret;
 
+	ret = -EBUSY;
 	if (!try_module_get(THIS_MODULE))
-		return -EBUSY;
+		goto out;
 
 	/* For latest state of bcache_is_reboot */
 	smp_mb();
 	if (bcache_is_reboot)
-		return -EBUSY;
+		goto out_module_put;
 
+	ret = -ENOMEM;
+	err = "cannot allocate memory";
 	path = kstrndup(buffer, size, GFP_KERNEL);
 	if (!path)
-		goto err;
+		goto out_module_put;
 
 	sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL);
 	if (!sb)
-		goto err;
+		goto out_free_path;
 
+	ret = -EINVAL;
 	err = "failed to open device";
 	bdev = blkdev_get_by_path(strim(path),
 				  FMODE_READ|FMODE_WRITE|FMODE_EXCL,
@@ -2414,57 +2418,68 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			if (!IS_ERR(bdev))
 				bdput(bdev);
 			if (attr == &ksysfs_register_quiet)
-				goto quiet_out;
+				goto done;
 		}
-		goto err;
+		goto out_free_sb;
 	}
 
 	err = "failed to set blocksize";
 	if (set_blocksize(bdev, 4096))
-		goto err_close;
+		goto out_blkdev_put;
 
 	err = read_super(sb, bdev, &sb_page);
 	if (err)
-		goto err_close;
+		goto out_blkdev_put;
 
 	err = "failed to register device";
 	if (SB_IS_BDEV(sb)) {
 		struct cached_dev *dc = kzalloc(sizeof(*dc), GFP_KERNEL);
 
 		if (!dc)
-			goto err_close;
+			goto out_put_sb_page;
 
 		mutex_lock(&bch_register_lock);
 		ret = register_bdev(sb, sb_page, bdev, dc);
 		mutex_unlock(&bch_register_lock);
 		/* blkdev_put() will be called in cached_dev_free() */
-		if (ret < 0)
-			goto err;
+		if (ret < 0) {
+			bdev = NULL;
+			goto out_put_sb_page;
+		}
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 
 		if (!ca)
-			goto err_close;
+			goto out_put_sb_page;
 
 		/* blkdev_put() will be called in bch_cache_release() */
-		if (register_cache(sb, sb_page, bdev, ca) != 0)
-			goto err;
+		if (register_cache(sb, sb_page, bdev, ca) != 0) {
+			bdev = NULL;
+			goto out_put_sb_page;
+		}
 	}
-quiet_out:
-	ret = size;
-out:
-	if (sb_page)
-		put_page(sb_page);
+
+	put_page(sb_page);
+done:
 	kfree(sb);
 	kfree(path);
 	module_put(THIS_MODULE);
-	return ret;
-
-err_close:
-	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
-err:
+	return size;
+
+out_put_sb_page:
+	put_page(sb_page);
+out_blkdev_put:
+	if (bdev)
+		blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+out_free_sb:
+	kfree(sb);
+out_free_path:
+	kfree(path);
+out_module_put:
+	module_put(THIS_MODULE);
+out:
 	pr_info("error %s: %s", path, err);
-	goto out;
+	return ret;
 }
 
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/17] bcache: properly initialize 'path' and 'err' in register_bcache()
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (2 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 03/17] bcache: rework error unwinding in register_bcache colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 05/17] bcache: fix use-after-free " colyli
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Patch "bcache: rework error unwinding in register_bcache" from
Christoph Hellwig changes the local variables 'path' and 'err'
in undefined initial state. If the code in register_bcache() jumps
to label 'out:' or 'out_module_put:' by goto, these two variables
might be reference with undefined value by the following line,

	out_module_put:
	        module_put(THIS_MODULE);
	out:
	        pr_info("error %s: %s", path, err);
	        return ret;

Therefore this patch initializes these two local variables properly
in register_bcache() to avoid such issue.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e8013e1b0a14..ee7c87f38d0d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2376,18 +2376,20 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size)
 {
 	const char *err;
-	char *path;
+	char *path = NULL;
 	struct cache_sb *sb;
 	struct block_device *bdev = NULL;
 	struct page *sb_page;
 	ssize_t ret;
 
 	ret = -EBUSY;
+	err = "failed to reference bcache module";
 	if (!try_module_get(THIS_MODULE))
 		goto out;
 
 	/* For latest state of bcache_is_reboot */
 	smp_mb();
+	err = "bcache is in reboot";
 	if (bcache_is_reboot)
 		goto out_module_put;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/17] bcache: fix use-after-free in register_bcache()
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (3 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 04/17] bcache: properly initialize 'path' and 'err' in register_bcache() colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 06/17] bcache: transfer the sb_page reference to register_{bdev,cache} colyli
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Christoph Hellwig

From: Coly Li <colyli@suse.de>

The patch "bcache: rework error unwinding in register_bcache" introduces
a use-after-free regression in register_bcache(). Here are current code,
	2510 out_free_path:
	2511         kfree(path);
	2512 out_module_put:
	2513         module_put(THIS_MODULE);
	2514 out:
	2515         pr_info("error %s: %s", path, err);
	2516         return ret;
If some error happens and the above code path is executed, at line 2511
path is released, but referenced at line 2515. Then KASAN reports a use-
after-free error message.

This patch changes line 2515 in the following way to fix the problem,
	2515         pr_info("error %s: %s", path?path:"", err);

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
---
 drivers/md/bcache/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index ee7c87f38d0d..fad9c6cbee5e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2477,10 +2477,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	kfree(sb);
 out_free_path:
 	kfree(path);
+	path = NULL;
 out_module_put:
 	module_put(THIS_MODULE);
 out:
-	pr_info("error %s: %s", path, err);
+	pr_info("error %s: %s", path?path:"", err);
 	return ret;
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/17] bcache: transfer the sb_page reference to register_{bdev,cache}
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (4 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 05/17] bcache: fix use-after-free " colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 07/17] bcache: return a pointer to the on-disk sb from read_super colyli
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Christoph Hellwig, Coly Li

From: Christoph Hellwig <hch@lst.de>

Avoid an extra reference count roundtrip by transferring the sb_page
ownership to the lower level register helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index fad9c6cbee5e..cf2cafef381f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1368,8 +1368,6 @@ static int register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
 	bio_first_bvec_all(&dc->sb_bio)->bv_page = sb_page;
-	get_page(sb_page);
-
 
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
@@ -2275,7 +2273,6 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
 	bio_first_bvec_all(&ca->sb_bio)->bv_page = sb_page;
-	get_page(sb_page);
 
 	if (blk_queue_discard(bdev_get_queue(bdev)))
 		ca->discard = CACHE_DISCARD(&ca->sb);
@@ -2378,7 +2375,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	const char *err;
 	char *path = NULL;
 	struct cache_sb *sb;
-	struct block_device *bdev = NULL;
+	struct block_device *bdev;
 	struct page *sb_page;
 	ssize_t ret;
 
@@ -2444,10 +2441,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		ret = register_bdev(sb, sb_page, bdev, dc);
 		mutex_unlock(&bch_register_lock);
 		/* blkdev_put() will be called in cached_dev_free() */
-		if (ret < 0) {
-			bdev = NULL;
-			goto out_put_sb_page;
-		}
+		if (ret < 0)
+			goto out_free_sb;
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 
@@ -2455,13 +2450,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto out_put_sb_page;
 
 		/* blkdev_put() will be called in bch_cache_release() */
-		if (register_cache(sb, sb_page, bdev, ca) != 0) {
-			bdev = NULL;
-			goto out_put_sb_page;
-		}
+		if (register_cache(sb, sb_page, bdev, ca) != 0)
+			goto out_free_sb;
 	}
 
-	put_page(sb_page);
 done:
 	kfree(sb);
 	kfree(path);
@@ -2471,8 +2463,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 out_put_sb_page:
 	put_page(sb_page);
 out_blkdev_put:
-	if (bdev)
-		blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+	blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 out_free_sb:
 	kfree(sb);
 out_free_path:
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/17] bcache: return a pointer to the on-disk sb from read_super
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (5 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 06/17] bcache: transfer the sb_page reference to register_{bdev,cache} colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 08/17] bcache: store a pointer to the on-disk sb in the cache and cached_dev structures colyli
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Christoph Hellwig, Coly Li

From: Christoph Hellwig <hch@lst.de>

Returning the properly typed actual data structure insteaf of the
containing struct page will save the callers some work going
forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index cf2cafef381f..b30e4c6011d2 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -60,7 +60,7 @@ struct workqueue_struct *bch_journal_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			      struct page **res)
+			      struct cache_sb_disk **res)
 {
 	const char *err;
 	struct cache_sb_disk *s;
@@ -191,7 +191,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	err = NULL;
 
 	get_page(bh->b_page);
-	*res = bh->b_page;
+	*res = s;
 err:
 	put_bh(bh);
 	return err;
@@ -1353,7 +1353,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 /* Cached device - bcache superblock */
 
-static int register_bdev(struct cache_sb *sb, struct page *sb_page,
+static int register_bdev(struct cache_sb *sb, struct cache_sb_disk *sb_disk,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1367,7 +1367,7 @@ static int register_bdev(struct cache_sb *sb, struct page *sb_page,
 	dc->bdev->bd_holder = dc;
 
 	bio_init(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
-	bio_first_bvec_all(&dc->sb_bio)->bv_page = sb_page;
+	bio_first_bvec_all(&dc->sb_bio)->bv_page = virt_to_page(sb_disk);
 
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
@@ -2260,7 +2260,7 @@ static int cache_alloc(struct cache *ca)
 	return ret;
 }
 
-static int register_cache(struct cache_sb *sb, struct page *sb_page,
+static int register_cache(struct cache_sb *sb, struct cache_sb_disk *sb_disk,
 				struct block_device *bdev, struct cache *ca)
 {
 	const char *err = NULL; /* must be set for any error case */
@@ -2272,7 +2272,7 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 	ca->bdev->bd_holder = ca;
 
 	bio_init(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
-	bio_first_bvec_all(&ca->sb_bio)->bv_page = sb_page;
+	bio_first_bvec_all(&ca->sb_bio)->bv_page = virt_to_page(sb_disk);
 
 	if (blk_queue_discard(bdev_get_queue(bdev)))
 		ca->discard = CACHE_DISCARD(&ca->sb);
@@ -2375,8 +2375,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	const char *err;
 	char *path = NULL;
 	struct cache_sb *sb;
+	struct cache_sb_disk *sb_disk;
 	struct block_device *bdev;
-	struct page *sb_page;
 	ssize_t ret;
 
 	ret = -EBUSY;
@@ -2426,7 +2426,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (set_blocksize(bdev, 4096))
 		goto out_blkdev_put;
 
-	err = read_super(sb, bdev, &sb_page);
+	err = read_super(sb, bdev, &sb_disk);
 	if (err)
 		goto out_blkdev_put;
 
@@ -2438,7 +2438,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto out_put_sb_page;
 
 		mutex_lock(&bch_register_lock);
-		ret = register_bdev(sb, sb_page, bdev, dc);
+		ret = register_bdev(sb, sb_disk, bdev, dc);
 		mutex_unlock(&bch_register_lock);
 		/* blkdev_put() will be called in cached_dev_free() */
 		if (ret < 0)
@@ -2450,7 +2450,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto out_put_sb_page;
 
 		/* blkdev_put() will be called in bch_cache_release() */
-		if (register_cache(sb, sb_page, bdev, ca) != 0)
+		if (register_cache(sb, sb_disk, bdev, ca) != 0)
 			goto out_free_sb;
 	}
 
@@ -2461,7 +2461,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	return size;
 
 out_put_sb_page:
-	put_page(sb_page);
+	put_page(virt_to_page(sb_disk));
 out_blkdev_put:
 	blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 out_free_sb:
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/17] bcache: store a pointer to the on-disk sb in the cache and cached_dev structures
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (6 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 07/17] bcache: return a pointer to the on-disk sb from read_super colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 09/17] bcache: use read_cache_page_gfp to read the superblock colyli
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Christoph Hellwig, Coly Li

From: Christoph Hellwig <hch@lst.de>

This allows to properly build the superblock bio including the offset in
the page using the normal bio helpers.  This fixes writing the superblock
for page sizes larger than 4k where the sb write bio would need an offset
in the bio_vec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 34 +++++++++++++++-------------------
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 9198c1b480d9..adf26a21fcd1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -301,6 +301,7 @@ struct cached_dev {
 	struct block_device	*bdev;
 
 	struct cache_sb		sb;
+	struct cache_sb_disk	*sb_disk;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
 	struct closure		sb_write;
@@ -403,6 +404,7 @@ enum alloc_reserve {
 struct cache {
 	struct cache_set	*set;
 	struct cache_sb		sb;
+	struct cache_sb_disk	*sb_disk;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b30e4c6011d2..5797c03f993e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -207,15 +207,15 @@ static void write_bdev_super_endio(struct bio *bio)
 	closure_put(&dc->sb_write);
 }
 
-static void __write_super(struct cache_sb *sb, struct bio *bio)
+static void __write_super(struct cache_sb *sb, struct cache_sb_disk *out,
+		struct bio *bio)
 {
-	struct cache_sb_disk *out = page_address(bio_first_page_all(bio));
 	unsigned int i;
 
+	bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_META;
 	bio->bi_iter.bi_sector	= SB_SECTOR;
-	bio->bi_iter.bi_size	= SB_SIZE;
-	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC|REQ_META);
-	bch_bio_map(bio, NULL);
+	__bio_add_page(bio, virt_to_page(out), SB_SIZE,
+			offset_in_page(out));
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -257,14 +257,14 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent)
 	down(&dc->sb_write_mutex);
 	closure_init(cl, parent);
 
-	bio_reset(bio);
+	bio_init(bio, dc->sb_bv, 1);
 	bio_set_dev(bio, dc->bdev);
 	bio->bi_end_io	= write_bdev_super_endio;
 	bio->bi_private = dc;
 
 	closure_get(cl);
 	/* I/O request sent to backing device */
-	__write_super(&dc->sb, bio);
+	__write_super(&dc->sb, dc->sb_disk, bio);
 
 	closure_return_with_destructor(cl, bch_write_bdev_super_unlock);
 }
@@ -306,13 +306,13 @@ void bcache_write_super(struct cache_set *c)
 
 		SET_CACHE_SYNC(&ca->sb, CACHE_SYNC(&c->sb));
 
-		bio_reset(bio);
+		bio_init(bio, ca->sb_bv, 1);
 		bio_set_dev(bio, ca->bdev);
 		bio->bi_end_io	= write_super_endio;
 		bio->bi_private = ca;
 
 		closure_get(cl);
-		__write_super(&ca->sb, bio);
+		__write_super(&ca->sb, ca->sb_disk, bio);
 	}
 
 	closure_return_with_destructor(cl, bcache_write_super_unlock);
@@ -1275,8 +1275,8 @@ static void cached_dev_free(struct closure *cl)
 
 	mutex_unlock(&bch_register_lock);
 
-	if (dc->sb_bio.bi_inline_vecs[0].bv_page)
-		put_page(bio_first_page_all(&dc->sb_bio));
+	if (dc->sb_disk)
+		put_page(virt_to_page(dc->sb_disk));
 
 	if (!IS_ERR_OR_NULL(dc->bdev))
 		blkdev_put(dc->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1365,9 +1365,7 @@ static int register_bdev(struct cache_sb *sb, struct cache_sb_disk *sb_disk,
 	memcpy(&dc->sb, sb, sizeof(struct cache_sb));
 	dc->bdev = bdev;
 	dc->bdev->bd_holder = dc;
-
-	bio_init(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
-	bio_first_bvec_all(&dc->sb_bio)->bv_page = virt_to_page(sb_disk);
+	dc->sb_disk = sb_disk;
 
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
@@ -2137,8 +2135,8 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
-	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
-		put_page(bio_first_page_all(&ca->sb_bio));
+	if (ca->sb_disk)
+		put_page(virt_to_page(ca->sb_disk));
 
 	if (!IS_ERR_OR_NULL(ca->bdev))
 		blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -2270,9 +2268,7 @@ static int register_cache(struct cache_sb *sb, struct cache_sb_disk *sb_disk,
 	memcpy(&ca->sb, sb, sizeof(struct cache_sb));
 	ca->bdev = bdev;
 	ca->bdev->bd_holder = ca;
-
-	bio_init(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
-	bio_first_bvec_all(&ca->sb_bio)->bv_page = virt_to_page(sb_disk);
+	ca->sb_disk = sb_disk;
 
 	if (blk_queue_discard(bdev_get_queue(bdev)))
 		ca->discard = CACHE_DISCARD(&ca->sb);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/17] bcache: use read_cache_page_gfp to read the superblock
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (7 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 08/17] bcache: store a pointer to the on-disk sb in the cache and cached_dev structures colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 10/17] lib: crc64: include <linux/crc64.h> for 'crc64_be' colyli
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Christoph Hellwig, Coly Li

From: Christoph Hellwig <hch@lst.de>

Avoid a pointless dependency on buffer heads in bcache by simply open
coding reading a single page.  Also add a SB_OFFSET define for the
byte offset of the superblock instead of using magic numbers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c   | 16 +++++++---------
 include/uapi/linux/bcache.h |  1 +
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 5797c03f993e..3dea1d5acd5c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -15,7 +15,6 @@
 #include "writeback.h"
 
 #include <linux/blkdev.h>
-#include <linux/buffer_head.h>
 #include <linux/debugfs.h>
 #include <linux/genhd.h>
 #include <linux/idr.h>
@@ -64,13 +63,14 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 {
 	const char *err;
 	struct cache_sb_disk *s;
-	struct buffer_head *bh = __bread(bdev, 1, SB_SIZE);
+	struct page *page;
 	unsigned int i;
 
-	if (!bh)
+	page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
+				   SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
+	if (IS_ERR(page))
 		return "IO error";
-
-	s = (struct cache_sb_disk *)bh->b_data;
+	s = page_address(page) + offset_in_page(SB_OFFSET);
 
 	sb->offset		= le64_to_cpu(s->offset);
 	sb->version		= le64_to_cpu(s->version);
@@ -188,12 +188,10 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	}
 
 	sb->last_mount = (u32)ktime_get_real_seconds();
-	err = NULL;
-
-	get_page(bh->b_page);
 	*res = s;
+	return NULL;
 err:
-	put_bh(bh);
+	put_page(page);
 	return err;
 }
 
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 1d8b3a9fc080..9a1965c6c3d0 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -148,6 +148,7 @@ static inline struct bkey *bkey_idx(const struct bkey *k, unsigned int nr_keys)
 #define BCACHE_SB_MAX_VERSION		4
 
 #define SB_SECTOR			8
+#define SB_OFFSET			(SB_SECTOR << SECTOR_SHIFT)
 #define SB_SIZE				4096
 #define SB_LABEL_SIZE			32
 #define SB_JOURNAL_BUCKETS		256U
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/17] lib: crc64: include <linux/crc64.h> for 'crc64_be'
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (8 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 09/17] bcache: use read_cache_page_gfp to read the superblock colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 11/17] bcache: add code comments for state->pool in __btree_sort() colyli
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Ben Dooks (Codethink), Coly Li

From: "Ben Dooks (Codethink)" <ben.dooks@codethink.co.uk>

The crc64_be() is declared in <linux/crc64.h> so include
this where the symbol is defined to avoid the following
warning:

lib/crc64.c:43:12: warning: symbol 'crc64_be' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 lib/crc64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/crc64.c b/lib/crc64.c
index 0ef8ae6ac047..f8928ce28280 100644
--- a/lib/crc64.c
+++ b/lib/crc64.c
@@ -28,6 +28,7 @@
 
 #include <linux/module.h>
 #include <linux/types.h>
+#include <linux/crc64.h>
 #include "crc64table.h"
 
 MODULE_DESCRIPTION("CRC64 calculations");
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/17] bcache: add code comments for state->pool in __btree_sort()
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (9 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 10/17] lib: crc64: include <linux/crc64.h> for 'crc64_be' colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 12/17] bcache: avoid unnecessary btree nodes flushing in btree_flush_write() colyli
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

To explain the pages allocated from mempool state->pool can be
swapped in __btree_sort(), because state->pool is a page pool,
which allocates pages by alloc_pages() indeed.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bset.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index cffcdc9feefb..4385303836d8 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -1257,6 +1257,11 @@ static void __btree_sort(struct btree_keys *b, struct btree_iter *iter,
 		 * Our temporary buffer is the same size as the btree node's
 		 * buffer, we can just swap buffers instead of doing a big
 		 * memcpy()
+		 *
+		 * Don't worry event 'out' is allocated from mempool, it can
+		 * still be swapped here. Because state->pool is a page mempool
+		 * creaated by by mempool_init_page_pool(), which allocates
+		 * pages by alloc_pages() indeed.
 		 */
 
 		out->magic	= b->set->data->magic;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/17] bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (10 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 11/17] bcache: add code comments for state->pool in __btree_sort() colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 13/17] bcache: print written and keys in trace_bcache_btree_write colyli
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

the commit 91be66e1318f ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
  of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
  dirty nodes from tail of c->btree_cache util the btree node with
  oldest journal entry is flushed. This is to reduce the time of holding
  c->bucket_lock.

Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.

After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
  a LRU-like list, the btree nodes with oldest journal entry won't
  be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
  entry, before flushing all the nodes the oldest journal entry cannot
  be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.

Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
  fifo c->journal.pin. In rested locations check the journal entries
  locklessly, so their values can be changed on other cores
  in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
  point of fifo c->journal.pin. If it is different from the original
  point which we get with locking c->journal.lock, it means the oldest
  journal entry is reclaim on other cores. At this moment, all selected
  dirty nodes recorded in array btree_nodes[] are all flushed and clean
  on other CPU cores, it is unncessary to iterate c->btree_cache any
  longer. Just quit the list_for_each_entry_safe_reverse() loop and
  the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
  loop. Check the refcount value of orignial fifo front point, if the
  value is larger than selected node number of btree_nodes[], it means
  more matching btree nodes should be scanned. Otherwise it means no
  more matching btee nodes in rest of c->btree_cache list, the loop
  can be quit. If the original oldest journal entry is reclaimed and
  fifo front point is updated, the refcount of original fifo front point
  will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
  for space allocation for cached data, hold it for too long time will
  block regular I/O requests. When iterating list c->btree_cache, even
  there are a lot of maching btree nodes, in order to not holding
  c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
  selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for  unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.

At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?

Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.

Fixes: 91be66e1318f ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang <fangguoju@gmail.com>
Reported-by: Shuang Li <psymon@bonuscloud.io>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 80 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 75 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index be2a2a201603..33ddc5269e8d 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -417,10 +417,14 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 
 /* Journalling */
 
+#define nr_to_fifo_front(p, front_p, mask)	(((p) - (front_p)) & (mask))
+
 static void btree_flush_write(struct cache_set *c)
 {
 	struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
-	unsigned int i, n;
+	unsigned int i, nr, ref_nr;
+	atomic_t *fifo_front_p, *now_fifo_front_p;
+	size_t mask;
 
 	if (c->journal.btree_flushing)
 		return;
@@ -433,12 +437,50 @@ static void btree_flush_write(struct cache_set *c)
 	c->journal.btree_flushing = true;
 	spin_unlock(&c->journal.flush_write_lock);
 
+	/* get the oldest journal entry and check its refcount */
+	spin_lock(&c->journal.lock);
+	fifo_front_p = &fifo_front(&c->journal.pin);
+	ref_nr = atomic_read(fifo_front_p);
+	if (ref_nr <= 0) {
+		/*
+		 * do nothing if no btree node references
+		 * the oldest journal entry
+		 */
+		spin_unlock(&c->journal.lock);
+		goto out;
+	}
+	spin_unlock(&c->journal.lock);
+
+	mask = c->journal.pin.mask;
+	nr = 0;
 	atomic_long_inc(&c->flush_write);
 	memset(btree_nodes, 0, sizeof(btree_nodes));
-	n = 0;
 
 	mutex_lock(&c->bucket_lock);
 	list_for_each_entry_safe_reverse(b, t, &c->btree_cache, list) {
+		/*
+		 * It is safe to get now_fifo_front_p without holding
+		 * c->journal.lock here, because we don't need to know
+		 * the exactly accurate value, just check whether the
+		 * front pointer of c->journal.pin is changed.
+		 */
+		now_fifo_front_p = &fifo_front(&c->journal.pin);
+		/*
+		 * If the oldest journal entry is reclaimed and front
+		 * pointer of c->journal.pin changes, it is unnecessary
+		 * to scan c->btree_cache anymore, just quit the loop and
+		 * flush out what we have already.
+		 */
+		if (now_fifo_front_p != fifo_front_p)
+			break;
+		/*
+		 * quit this loop if all matching btree nodes are
+		 * scanned and record in btree_nodes[] already.
+		 */
+		ref_nr = atomic_read(fifo_front_p);
+		if (nr >= ref_nr)
+			break;
+
 		if (btree_node_journal_flush(b))
 			pr_err("BUG: flush_write bit should not be set here!");
 
@@ -454,17 +496,44 @@ static void btree_flush_write(struct cache_set *c)
 			continue;
 		}
 
+		/*
+		 * Only select the btree node which exactly references
+		 * the oldest journal entry.
+		 *
+		 * If the journal entry pointed by fifo_front_p is
+		 * reclaimed in parallel, don't worry:
+		 * - the list_for_each_xxx loop will quit when checking
+		 *   next now_fifo_front_p.
+		 * - If there are matched nodes recorded in btree_nodes[],
+		 *   they are clean now (this is why and how the oldest
+		 *   journal entry can be reclaimed). These selected nodes
+		 *   will be ignored and skipped in the folowing for-loop.
+		 */
+		if (nr_to_fifo_front(btree_current_write(b)->journal,
+				     fifo_front_p,
+				     mask) != 0) {
+			mutex_unlock(&b->write_lock);
+			continue;
+		}
+
 		set_btree_node_journal_flush(b);
 
 		mutex_unlock(&b->write_lock);
 
-		btree_nodes[n++] = b;
-		if (n == BTREE_FLUSH_NR)
+		btree_nodes[nr++] = b;
+		/*
+		 * To avoid holding c->bucket_lock too long time,
+		 * only scan for BTREE_FLUSH_NR matched btree nodes
+		 * at most. If there are more btree nodes reference
+		 * the oldest journal entry, try to flush them next
+		 * time when btree_flush_write() is called.
+		 */
+		if (nr == BTREE_FLUSH_NR)
 			break;
 	}
 	mutex_unlock(&c->bucket_lock);
 
-	for (i = 0; i < n; i++) {
+	for (i = 0; i < nr; i++) {
 		b = btree_nodes[i];
 		if (!b) {
 			pr_err("BUG: btree_nodes[%d] is NULL", i);
@@ -497,6 +566,7 @@ static void btree_flush_write(struct cache_set *c)
 		mutex_unlock(&b->write_lock);
 	}
 
+out:
 	spin_lock(&c->journal.flush_write_lock);
 	c->journal.btree_flushing = false;
 	spin_unlock(&c->journal.flush_write_lock);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/17] bcache: print written and keys in trace_bcache_btree_write
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (11 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 12/17] bcache: avoid unnecessary btree nodes flushing in btree_flush_write() colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 14/17] bcache: back to cache all readahead I/Os colyli
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Guoju Fang, Coly Li

From: Guoju Fang <fangguoju@gmail.com>

It's useful to dump written block and keys on btree write, this patch
add them into trace_bcache_btree_write.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 include/trace/events/bcache.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/bcache.h b/include/trace/events/bcache.h
index e4526f85c19d..0bddea663b3b 100644
--- a/include/trace/events/bcache.h
+++ b/include/trace/events/bcache.h
@@ -275,7 +275,8 @@ TRACE_EVENT(bcache_btree_write,
 		__entry->keys	= b->keys.set[b->keys.nsets].data->keys;
 	),
 
-	TP_printk("bucket %zu", __entry->bucket)
+	TP_printk("bucket %zu written block %u + %u",
+		__entry->bucket, __entry->block, __entry->keys)
 );
 
 DEFINE_EVENT(btree_node, bcache_btree_node_alloc,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (12 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 13/17] bcache: print written and keys in trace_bcache_btree_write colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:19   ` Michael Lyle
  2020-01-23 17:01 ` [PATCH 15/17] bcache: remove member accessed from struct btree colyli
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, stable

From: Coly Li <colyli@suse.de>

In year 2007 high performance SSD was still expensive, in order to
save more space for real workload or meta data, the readahead I/Os
for non-meta data was bypassed and not cached on SSD.

In now days, SSD price drops a lot and people can find larger size
SSD with more comfortable price. It is unncessary to bypass normal
readahead I/Os to save SSD space for now.

This patch removes the code which checks REQ_RAHEAD tag of bio in
check_should_bypass(), then all readahead I/Os will be cached on SSD.

NOTE: this patch still keeps the checking of "REQ_META|REQ_PRIO" in
should_writeback(), because we still want to cache meta data I/Os
even they are asynchronized.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
---
 drivers/md/bcache/request.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 73478a91a342..acc07c4f27ae 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -378,15 +378,6 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
 	     op_is_write(bio_op(bio))))
 		goto skip;
 
-	/*
-	 * Flag for bypass if the IO is for read-ahead or background,
-	 * unless the read-ahead request is for metadata
-	 * (eg, for gfs2 or xfs).
-	 */
-	if (bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
-	    !(bio->bi_opf & (REQ_META|REQ_PRIO)))
-		goto skip;
-
 	if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) ||
 	    bio_sectors(bio) & (c->sb.block_size - 1)) {
 		pr_debug("skipping unaligned io");
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 15/17] bcache: remove member accessed from struct btree
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (13 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 14/17] bcache: back to cache all readahead I/Os colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 16/17] bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() colyli
  2020-01-23 17:01 ` [PATCH 17/17] bcache: reap from tail of c->btree_cache " colyli
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

The member 'accessed' of struct btree is used in bch_mca_scan() when
shrinking btree node caches. The original idea is, if b->accessed is
set, clean it and look at next btree node cache from c->btree_cache
list, and only shrink the caches whose b->accessed is cleaned. Then
only cold btree node cache will be shrunk.

But when I/O pressure is high, it is very probably that b->accessed
of a btree node cache will be set again in bch_btree_node_get()
before bch_mca_scan() selects it again. Then there is no chance for
bch_mca_scan() to shrink enough memory back to slub or slab system.

This patch removes member accessed from struct btree, then once a
btree node ache is selected, it will be immediately shunk. By this
change, bch_mca_scan() may release btree node cahce more efficiently.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 8 ++------
 drivers/md/bcache/btree.h | 2 --
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 14d6c33b0957..357535a5c89c 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -754,14 +754,12 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
 		b = list_first_entry(&c->btree_cache, struct btree, list);
 		list_rotate_left(&c->btree_cache);
 
-		if (!b->accessed &&
-		    !mca_reap(b, 0, false)) {
+		if (!mca_reap(b, 0, false)) {
 			mca_bucket_free(b);
 			mca_data_free(b);
 			rw_unlock(true, b);
 			freed++;
-		} else
-			b->accessed = 0;
+		}
 	}
 out:
 	mutex_unlock(&c->bucket_lock);
@@ -1069,7 +1067,6 @@ struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op,
 	BUG_ON(!b->written);
 
 	b->parent = parent;
-	b->accessed = 1;
 
 	for (; i <= b->keys.nsets && b->keys.set[i].size; i++) {
 		prefetch(b->keys.set[i].tree);
@@ -1160,7 +1157,6 @@ struct btree *__bch_btree_node_alloc(struct cache_set *c, struct btree_op *op,
 		goto retry;
 	}
 
-	b->accessed = 1;
 	b->parent = parent;
 	bch_bset_init_next(&b->keys, b->keys.set->data, bset_magic(&b->c->sb));
 
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index 76cfd121a486..f4dcca449391 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -121,8 +121,6 @@ struct btree {
 	/* Key/pointer for this btree node */
 	BKEY_PADDED(key);
 
-	/* Single bit - set when accessed, cleared by shrinker */
-	unsigned long		accessed;
 	unsigned long		seq;
 	struct rw_semaphore	lock;
 	struct cache_set	*c;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 16/17] bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (14 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 15/17] bcache: remove member accessed from struct btree colyli
@ 2020-01-23 17:01 ` colyli
  2020-01-23 17:01 ` [PATCH 17/17] bcache: reap from tail of c->btree_cache " colyli
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

In order to skip the most recently freed btree node cahce, currently
in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
are skipped when shrinking bcache node caches in bch_mca_scan(). The
related code in bch_mca_scan() is,

 737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
 738         if (nr <= 0)
 739                 goto out;
 740
 741         if (++i > 3 &&
 742             !mca_reap(b, 0, false)) {
             		lines free cache memory
 746         }
 747         nr--;
 748 }

The problem is, if virtual memory code calls bch_mca_scan() and
the calculated 'nr' is 1 or 2, then in the above loop, nothing will
be shunk. In such case, if slub/slab manager calls bch_mca_scan()
for many times with small scan number, it does not help to shrink
cache memory and just wasts CPU cycles.

This patch just selects btree node caches from tail of the
c->btree_cache_freeable list, then the newly freed host cache can
still be allocated by mca_alloc(), and at least 1 node can be shunk.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 357535a5c89c..c3a314deb09d 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -734,17 +734,17 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
 
 	i = 0;
 	btree_cache_used = c->btree_cache_used;
-	list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
+	list_for_each_entry_safe_reverse(b, t, &c->btree_cache_freeable, list) {
 		if (nr <= 0)
 			goto out;
 
-		if (++i > 3 &&
-		    !mca_reap(b, 0, false)) {
+		if (!mca_reap(b, 0, false)) {
 			mca_data_free(b);
 			rw_unlock(true, b);
 			freed++;
 		}
 		nr--;
+		i++;
 	}
 
 	for (;  (nr--) && i < btree_cache_used; i++) {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 17/17] bcache: reap from tail of c->btree_cache in bch_mca_scan()
  2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
                   ` (15 preceding siblings ...)
  2020-01-23 17:01 ` [PATCH 16/17] bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() colyli
@ 2020-01-23 17:01 ` colyli
  16 siblings, 0 replies; 24+ messages in thread
From: colyli @ 2020-01-23 17:01 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

From: Coly Li <colyli@suse.de>

When shrink btree node cache from c->btree_cache in bch_mca_scan(),
no matter the selected node is reaped or not, it will be rotated from
the head to the tail of c->btree_cache list. But in bcache journal
code, when flushing the btree nodes with oldest journal entry, btree
nodes are iterated and slected from the tail of c->btree_cache list in
btree_flush_write(). The list_rotate_left() in bch_mca_scan() will
make btree_flush_write() iterate more nodes in c->btree_list in reverse
order.

This patch just reaps the selected btree node cache, and not move it
from the head to the tail of c->btree_cache list. Then bch_mca_scan()
will not mess up c->btree_cache list to btree_flush_write().

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index c3a314deb09d..fa872df4e770 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -747,19 +747,19 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
 		i++;
 	}
 
-	for (;  (nr--) && i < btree_cache_used; i++) {
-		if (list_empty(&c->btree_cache))
+	list_for_each_entry_safe_reverse(b, t, &c->btree_cache, list) {
+		if (nr <= 0 || i >= btree_cache_used)
 			goto out;
 
-		b = list_first_entry(&c->btree_cache, struct btree, list);
-		list_rotate_left(&c->btree_cache);
-
 		if (!mca_reap(b, 0, false)) {
 			mca_bucket_free(b);
 			mca_data_free(b);
 			rw_unlock(true, b);
 			freed++;
 		}
+
+		nr--;
+		i++;
 	}
 out:
 	mutex_unlock(&c->bucket_lock);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-23 17:01 ` [PATCH 14/17] bcache: back to cache all readahead I/Os colyli
@ 2020-01-23 17:19   ` Michael Lyle
  2020-01-23 17:27     ` Coly Li
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Lyle @ 2020-01-23 17:19 UTC (permalink / raw)
  To: Coly Li; +Cc: Jens Axboe, linux-bcache, linux-block, stable

Hi Coly and Jens--

One concern I have with this is that it's going to wear out
limited-lifetime SSDs a -lot- faster.  Was any thought given to making
this a tunable instead of just changing the behavior?  Even if we have
an anecdote or two that it seems to have increased performance for
some workloads, I don't expect it will have increased performance in
general and it may even be costly for some workloads (it all comes
down to what is more useful in the cache-- somewhat-recently readahead
data, or the data that it is displacing).

Regards,

Mike


On Thu, Jan 23, 2020 at 9:03 AM <colyli@suse.de> wrote:
>
> From: Coly Li <colyli@suse.de>
>
> In year 2007 high performance SSD was still expensive, in order to
> save more space for real workload or meta data, the readahead I/Os
> for non-meta data was bypassed and not cached on SSD.
>
> In now days, SSD price drops a lot and people can find larger size
> SSD with more comfortable price. It is unncessary to bypass normal
> readahead I/Os to save SSD space for now.
>
> This patch removes the code which checks REQ_RAHEAD tag of bio in
> check_should_bypass(), then all readahead I/Os will be cached on SSD.
>
> NOTE: this patch still keeps the checking of "REQ_META|REQ_PRIO" in
> should_writeback(), because we still want to cache meta data I/Os
> even they are asynchronized.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
> ---
>  drivers/md/bcache/request.c | 9 ---------
>  1 file changed, 9 deletions(-)
>
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 73478a91a342..acc07c4f27ae 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -378,15 +378,6 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
>              op_is_write(bio_op(bio))))
>                 goto skip;
>
> -       /*
> -        * Flag for bypass if the IO is for read-ahead or background,
> -        * unless the read-ahead request is for metadata
> -        * (eg, for gfs2 or xfs).
> -        */
> -       if (bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
> -           !(bio->bi_opf & (REQ_META|REQ_PRIO)))
> -               goto skip;
> -
>         if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) ||
>             bio_sectors(bio) & (c->sb.block_size - 1)) {
>                 pr_debug("skipping unaligned io");
> --
> 2.16.4
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-23 17:19   ` Michael Lyle
@ 2020-01-23 17:27     ` Coly Li
  2020-01-23 18:31       ` Jens Axboe
  2020-01-24 16:48       ` Michael Lyle
  0 siblings, 2 replies; 24+ messages in thread
From: Coly Li @ 2020-01-23 17:27 UTC (permalink / raw)
  To: Michael Lyle; +Cc: Jens Axboe, linux-bcache, linux-block, stable

On 2020/1/24 1:19 上午, Michael Lyle wrote:
> Hi Coly and Jens--
> 
> One concern I have with this is that it's going to wear out
> limited-lifetime SSDs a -lot- faster.  Was any thought given to making
> this a tunable instead of just changing the behavior?  Even if we have
> an anecdote or two that it seems to have increased performance for
> some workloads, I don't expect it will have increased performance in
> general and it may even be costly for some workloads (it all comes
> down to what is more useful in the cache-- somewhat-recently readahead
> data, or the data that it is displacing).

Hi Mike,

Copied. This is good suggestion, I will do it after I back from Lunar
New Year vacation, and submit it with other tested patches in following
v5.6-rc versions.

Thanks.

Coly Li

[snipped]

-- 

Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-23 17:27     ` Coly Li
@ 2020-01-23 18:31       ` Jens Axboe
  2020-01-24  0:49         ` Coly Li
  2020-01-24 16:48       ` Michael Lyle
  1 sibling, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2020-01-23 18:31 UTC (permalink / raw)
  To: Coly Li, Michael Lyle; +Cc: linux-bcache, linux-block, stable

On 1/23/20 10:27 AM, Coly Li wrote:
> On 2020/1/24 1:19 上午, Michael Lyle wrote:
>> Hi Coly and Jens--
>>
>> One concern I have with this is that it's going to wear out
>> limited-lifetime SSDs a -lot- faster.  Was any thought given to making
>> this a tunable instead of just changing the behavior?  Even if we have
>> an anecdote or two that it seems to have increased performance for
>> some workloads, I don't expect it will have increased performance in
>> general and it may even be costly for some workloads (it all comes
>> down to what is more useful in the cache-- somewhat-recently readahead
>> data, or the data that it is displacing).
> 
> Hi Mike,
> 
> Copied. This is good suggestion, I will do it after I back from Lunar
> New Year vacation, and submit it with other tested patches in following
> v5.6-rc versions.

Do you want me to just drop this patch for now from the series?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-23 18:31       ` Jens Axboe
@ 2020-01-24  0:49         ` Coly Li
  2020-01-24  1:14           ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2020-01-24  0:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Michael Lyle, linux-bcache, linux-block, stable

On 2020/1/24 2:31 上午, Jens Axboe wrote:
> On 1/23/20 10:27 AM, Coly Li wrote:
>> On 2020/1/24 1:19 上午, Michael Lyle wrote:
>>> Hi Coly and Jens--
>>>
>>> One concern I have with this is that it's going to wear out
>>> limited-lifetime SSDs a -lot- faster.  Was any thought given to making
>>> this a tunable instead of just changing the behavior?  Even if we have
>>> an anecdote or two that it seems to have increased performance for
>>> some workloads, I don't expect it will have increased performance in
>>> general and it may even be costly for some workloads (it all comes
>>> down to what is more useful in the cache-- somewhat-recently readahead
>>> data, or the data that it is displacing).
>>
>> Hi Mike,
>>
>> Copied. This is good suggestion, I will do it after I back from Lunar
>> New Year vacation, and submit it with other tested patches in following
>> v5.6-rc versions.
> 
> Do you want me to just drop this patch for now from the series?
> 
Hi Jens,

OK, please drop this patch from this series. I will re-submit the patch
with sysfs interface later with other patches.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-24  0:49         ` Coly Li
@ 2020-01-24  1:14           ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2020-01-24  1:14 UTC (permalink / raw)
  To: Coly Li; +Cc: Michael Lyle, linux-bcache, linux-block, stable

On 1/23/20 5:49 PM, Coly Li wrote:
> On 2020/1/24 2:31 上午, Jens Axboe wrote:
>> On 1/23/20 10:27 AM, Coly Li wrote:
>>> On 2020/1/24 1:19 上午, Michael Lyle wrote:
>>>> Hi Coly and Jens--
>>>>
>>>> One concern I have with this is that it's going to wear out
>>>> limited-lifetime SSDs a -lot- faster.  Was any thought given to making
>>>> this a tunable instead of just changing the behavior?  Even if we have
>>>> an anecdote or two that it seems to have increased performance for
>>>> some workloads, I don't expect it will have increased performance in
>>>> general and it may even be costly for some workloads (it all comes
>>>> down to what is more useful in the cache-- somewhat-recently readahead
>>>> data, or the data that it is displacing).
>>>
>>> Hi Mike,
>>>
>>> Copied. This is good suggestion, I will do it after I back from Lunar
>>> New Year vacation, and submit it with other tested patches in following
>>> v5.6-rc versions.
>>
>> Do you want me to just drop this patch for now from the series?
>>
> Hi Jens,
> 
> OK, please drop this patch from this series. I will re-submit the patch
> with sysfs interface later with other patches.

Sounds good, I queued up the rest for 5.6.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 14/17] bcache: back to cache all readahead I/Os
  2020-01-23 17:27     ` Coly Li
  2020-01-23 18:31       ` Jens Axboe
@ 2020-01-24 16:48       ` Michael Lyle
  1 sibling, 0 replies; 24+ messages in thread
From: Michael Lyle @ 2020-01-24 16:48 UTC (permalink / raw)
  To: Coly Li; +Cc: Jens Axboe, linux-bcache, linux-block, stable

Hi Coly---

Thank you for holding the patch.  I'm sorry for the late review (I was
travelling).

(We sure have a lot of settings and a lot of code dealing with them
all, which is unfortunate... but workloads / hardware used with bcache
are so varied).

Mike

On Thu, Jan 23, 2020 at 9:28 AM Coly Li <colyli@suse.de> wrote:
>
> On 2020/1/24 1:19 上午, Michael Lyle wrote:
> > Hi Coly and Jens--
> >
> > One concern I have with this is that it's going to wear out
> > limited-lifetime SSDs a -lot- faster.  Was any thought given to making
> > this a tunable instead of just changing the behavior?  Even if we have
> > an anecdote or two that it seems to have increased performance for
> > some workloads, I don't expect it will have increased performance in
> > general and it may even be costly for some workloads (it all comes
> > down to what is more useful in the cache-- somewhat-recently readahead
> > data, or the data that it is displacing).
>
> Hi Mike,
>
> Copied. This is good suggestion, I will do it after I back from Lunar
> New Year vacation, and submit it with other tested patches in following
> v5.6-rc versions.
>
> Thanks.
>
> Coly Li
>
> [snipped]
>
> --
>
> Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-01-24 16:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-23 17:01 [PATCH 00/17] bcache patches for Linux v5.6 colyli
2020-01-23 17:01 ` [PATCH 01/17] bcache: cached_dev_free needs to put the sb page colyli
2020-01-23 17:01 ` [PATCH 02/17] bcache: use a separate data structure for the on-disk super block colyli
2020-01-23 17:01 ` [PATCH 03/17] bcache: rework error unwinding in register_bcache colyli
2020-01-23 17:01 ` [PATCH 04/17] bcache: properly initialize 'path' and 'err' in register_bcache() colyli
2020-01-23 17:01 ` [PATCH 05/17] bcache: fix use-after-free " colyli
2020-01-23 17:01 ` [PATCH 06/17] bcache: transfer the sb_page reference to register_{bdev,cache} colyli
2020-01-23 17:01 ` [PATCH 07/17] bcache: return a pointer to the on-disk sb from read_super colyli
2020-01-23 17:01 ` [PATCH 08/17] bcache: store a pointer to the on-disk sb in the cache and cached_dev structures colyli
2020-01-23 17:01 ` [PATCH 09/17] bcache: use read_cache_page_gfp to read the superblock colyli
2020-01-23 17:01 ` [PATCH 10/17] lib: crc64: include <linux/crc64.h> for 'crc64_be' colyli
2020-01-23 17:01 ` [PATCH 11/17] bcache: add code comments for state->pool in __btree_sort() colyli
2020-01-23 17:01 ` [PATCH 12/17] bcache: avoid unnecessary btree nodes flushing in btree_flush_write() colyli
2020-01-23 17:01 ` [PATCH 13/17] bcache: print written and keys in trace_bcache_btree_write colyli
2020-01-23 17:01 ` [PATCH 14/17] bcache: back to cache all readahead I/Os colyli
2020-01-23 17:19   ` Michael Lyle
2020-01-23 17:27     ` Coly Li
2020-01-23 18:31       ` Jens Axboe
2020-01-24  0:49         ` Coly Li
2020-01-24  1:14           ` Jens Axboe
2020-01-24 16:48       ` Michael Lyle
2020-01-23 17:01 ` [PATCH 15/17] bcache: remove member accessed from struct btree colyli
2020-01-23 17:01 ` [PATCH 16/17] bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() colyli
2020-01-23 17:01 ` [PATCH 17/17] bcache: reap from tail of c->btree_cache " colyli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.