Linux-bcache Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 00/16] bcache: extend bucket size to 32bit width
@ 2020-07-15 14:29 colyli
  2020-07-15 14:30 ` [PATCH v3 01/16] bcache: add read_super_common() to read major part of super block colyli
                   ` (15 more replies)
  0 siblings, 16 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:29 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

This patch series is an effort to extent bucket_size from 16bit to
32bit width. This is the code base for next step to use zoned device as
cache device.

In this series, these changes should be noticed,
- Increase the super block version to
        BCACHE_SB_VERSION_CDEV_WITH_FEATURES 5
        BCACHE_SB_VERSION_BDEV_WITH_FEATURES 6
- The new version adds 3 new members,
        __le64          feature_compat;
        __le64          feature_incompat;
        __le64          feature_ro_compat;
  If new feature added will change the on-disk format, just setting the
  feature set bits, no need to upstream super block version.
- The in-memory struct cache_sb does not exactly map the on-disk struct
  cache_sb_disk.
- Add 32bit bucket_size_hi in struct cache_sb_disk,
        __le32          bucket_size_hi;
  In theory bucket_size can be extended to 48bit, but now 32bit width is
  big enough.
- Other changes to follow the large bucket size.

The changes are mostly on cache device, except the super block update
there is no change about the backing device.

Any code review comments or suggestion is welcome. Thanks in advance. 

Coly Li
---
Changelog:
v3: Update with Hannes' comments of v2 series code review.
v2: several bug fixes, and add sysfs file to display cache set feature
    sets information.
v1: initial RFC version.

Coly Li (16):
  bcache: add read_super_common() to read major part of super block
  bcache: add more accurate error information in read_super_common()
  bcache: disassemble the big if() checks in bch_cache_set_alloc()
  bcache: fix super block seq numbers comparision in
    register_cache_set()
  bcache: increase super block version for cache device and backing
    device
  bcache: move bucket related code into read_super_common()
  bcache: struct cache_sb is only for in-memory super block now
  bcache: introduce meta_bucket_pages() related helper routines
  bcache: handle c->uuids properly for bucket size > 8MB
  bcache: handle cache prio_buckets and disk_buckets properly for bucket
    size > 8MB
  bcache: handle cache set verify_ondisk properly for bucket size > 8MB
  bcache: handle btree node memory allocation properly for bucket size >
    8MB
  bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
  bcache: add sysfs file to display feature sets information of cache
    set
  bcache: avoid extra memory allocation from mempool c->fill_iter
  bcache: avoid extra memory consumption in struct bbio for large bucket
    size

 drivers/md/bcache/Makefile   |   2 +-
 drivers/md/bcache/alloc.c    |   2 +-
 drivers/md/bcache/bcache.h   |  29 +++-
 drivers/md/bcache/btree.c    |  12 +-
 drivers/md/bcache/features.c |  70 ++++++++++
 drivers/md/bcache/features.h |  84 +++++++++++
 drivers/md/bcache/io.c       |   2 +-
 drivers/md/bcache/movinggc.c |   4 +-
 drivers/md/bcache/super.c    | 261 ++++++++++++++++++++++++-----------
 drivers/md/bcache/sysfs.c    |   6 +
 include/uapi/linux/bcache.h  |  38 +++--
 11 files changed, 407 insertions(+), 103 deletions(-)
 create mode 100644 drivers/md/bcache/features.c
 create mode 100644 drivers/md/bcache/features.h

-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 01/16] bcache: add read_super_common() to read major part of super block
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 14:30 ` [PATCH v3 02/16] bcache: add more accurate error information in read_super_common() colyli
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Later patches will introduce feature set bits to on-disk super block and
increase super block version. Current code in read_super() which reads
common part of super block for version BCACHE_SB_VERSION_CDEV and version
BCACHE_SB_VERSION_CDEV_WITH_UUID will be shared with the new version.

Therefore this patch moves the reusable part into read_super_common(),
this preparation patch will make later patches more simplier and only
focus on new feature set bits.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 111 +++++++++++++++++++++-----------------
 1 file changed, 63 insertions(+), 48 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 7aa1d6737f66..9fe0e4dcfd1b 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -59,6 +59,67 @@ struct workqueue_struct *bch_journal_wq;
 
 /* Superblock */
 
+static const char *read_super_common(struct cache_sb *sb,  struct block_device *bdev,
+				     struct cache_sb_disk *s)
+{
+	const char *err;
+	unsigned int i;
+
+	sb->nbuckets	= le64_to_cpu(s->nbuckets);
+	sb->bucket_size	= le16_to_cpu(s->bucket_size);
+
+	sb->nr_in_set	= le16_to_cpu(s->nr_in_set);
+	sb->nr_this_dev	= le16_to_cpu(s->nr_this_dev);
+
+	err = "Too many buckets";
+	if (sb->nbuckets > LONG_MAX)
+		goto err;
+
+	err = "Not enough buckets";
+	if (sb->nbuckets < 1 << 7)
+		goto err;
+
+	err = "Bad block/bucket size";
+	if (!is_power_of_2(sb->block_size) ||
+	    sb->block_size > PAGE_SECTORS ||
+	    !is_power_of_2(sb->bucket_size) ||
+	    sb->bucket_size < PAGE_SECTORS)
+		goto err;
+
+	err = "Invalid superblock: device too small";
+	if (get_capacity(bdev->bd_disk) <
+	    sb->bucket_size * sb->nbuckets)
+		goto err;
+
+	err = "Bad UUID";
+	if (bch_is_zero(sb->set_uuid, 16))
+		goto err;
+
+	err = "Bad cache device number in set";
+	if (!sb->nr_in_set ||
+	    sb->nr_in_set <= sb->nr_this_dev ||
+	    sb->nr_in_set > MAX_CACHES_PER_SET)
+		goto err;
+
+	err = "Journal buckets not sequential";
+	for (i = 0; i < sb->keys; i++)
+		if (sb->d[i] != sb->first_bucket + i)
+			goto err;
+
+	err = "Too many journal buckets";
+	if (sb->first_bucket + sb->keys > sb->nbuckets)
+		goto err;
+
+	err = "Invalid superblock: first bucket comes before end of super";
+	if (sb->first_bucket * sb->bucket_size < 16)
+		goto err;
+
+	err = NULL;
+err:
+	return err;
+}
+
+
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 			      struct cache_sb_disk **res)
 {
@@ -133,55 +194,9 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 		break;
 	case BCACHE_SB_VERSION_CDEV:
 	case BCACHE_SB_VERSION_CDEV_WITH_UUID:
-		sb->nbuckets	= le64_to_cpu(s->nbuckets);
-		sb->bucket_size	= le16_to_cpu(s->bucket_size);
-
-		sb->nr_in_set	= le16_to_cpu(s->nr_in_set);
-		sb->nr_this_dev	= le16_to_cpu(s->nr_this_dev);
-
-		err = "Too many buckets";
-		if (sb->nbuckets > LONG_MAX)
-			goto err;
-
-		err = "Not enough buckets";
-		if (sb->nbuckets < 1 << 7)
-			goto err;
-
-		err = "Bad block/bucket size";
-		if (!is_power_of_2(sb->block_size) ||
-		    sb->block_size > PAGE_SECTORS ||
-		    !is_power_of_2(sb->bucket_size) ||
-		    sb->bucket_size < PAGE_SECTORS)
-			goto err;
-
-		err = "Invalid superblock: device too small";
-		if (get_capacity(bdev->bd_disk) <
-		    sb->bucket_size * sb->nbuckets)
-			goto err;
-
-		err = "Bad UUID";
-		if (bch_is_zero(sb->set_uuid, 16))
-			goto err;
-
-		err = "Bad cache device number in set";
-		if (!sb->nr_in_set ||
-		    sb->nr_in_set <= sb->nr_this_dev ||
-		    sb->nr_in_set > MAX_CACHES_PER_SET)
-			goto err;
-
-		err = "Journal buckets not sequential";
-		for (i = 0; i < sb->keys; i++)
-			if (sb->d[i] != sb->first_bucket + i)
-				goto err;
-
-		err = "Too many journal buckets";
-		if (sb->first_bucket + sb->keys > sb->nbuckets)
-			goto err;
-
-		err = "Invalid superblock: first bucket comes before end of super";
-		if (sb->first_bucket * sb->bucket_size < 16)
+		err = read_super_common(sb, bdev, s);
+		if (err)
 			goto err;
-
 		break;
 	default:
 		err = "Unsupported superblock version";
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 02/16] bcache: add more accurate error information in read_super_common()
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
  2020-07-15 14:30 ` [PATCH v3 01/16] bcache: add read_super_common() to read major part of super block colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 14:30 ` [PATCH v3 03/16] bcache: disassemble the big if() checks in bch_cache_set_alloc() colyli
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Hannes Reinecke

From: Coly Li <colyli@suse.de>

The improperly set bucket or block size will trigger error in
read_super_common(). For large bucket size, a more accurate error message
for invalid bucket or block size is necessary.

This patch disassembles the combined if() checks into multiple single
if() check, and provide more accurate error message for each check
failure condition.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/bcache/super.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 9fe0e4dcfd1b..e162941122d4 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -79,11 +79,20 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
 	if (sb->nbuckets < 1 << 7)
 		goto err;
 
-	err = "Bad block/bucket size";
-	if (!is_power_of_2(sb->block_size) ||
-	    sb->block_size > PAGE_SECTORS ||
-	    !is_power_of_2(sb->bucket_size) ||
-	    sb->bucket_size < PAGE_SECTORS)
+	err = "Bad block size (not power of 2)";
+	if (!is_power_of_2(sb->block_size))
+		goto err;
+
+	err = "Bad block size (larger than page size)";
+	if (sb->block_size > PAGE_SECTORS)
+		goto err;
+
+	err = "Bad bucket size (not power of 2)";
+	if (!is_power_of_2(sb->bucket_size))
+		goto err;
+
+	err = "Bad bucket size (smaller than page size)";
+	if (sb->bucket_size < PAGE_SECTORS)
 		goto err;
 
 	err = "Invalid superblock: device too small";
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 03/16] bcache: disassemble the big if() checks in bch_cache_set_alloc()
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
  2020-07-15 14:30 ` [PATCH v3 01/16] bcache: add read_super_common() to read major part of super block colyli
  2020-07-15 14:30 ` [PATCH v3 02/16] bcache: add more accurate error information in read_super_common() colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 14:30 ` [PATCH v3 04/16] bcache: fix super block seq numbers comparision in register_cache_set() colyli
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Hannes Reinecke

From: Coly Li <colyli@suse.de>

In bch_cache_set_alloc() there is a big if() checks combined by 11 items
together. When this big if() statement fails, it is difficult to tell
exactly which item fails indeed.

This patch disassembles this big if() checks into 11 single if() checks,
which makes code debug more easier.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/bcache/super.c | 52 ++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e162941122d4..acee8163f38a 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1865,21 +1865,43 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	iter_size = (sb->bucket_size / sb->block_size + 1) *
 		sizeof(struct btree_iter_set);
 
-	if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL)) ||
-	    mempool_init_slab_pool(&c->search, 32, bch_search_cache) ||
-	    mempool_init_kmalloc_pool(&c->bio_meta, 2,
-				sizeof(struct bbio) + sizeof(struct bio_vec) *
-				bucket_pages(c)) ||
-	    mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) ||
-	    bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio),
-			BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER) ||
-	    !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
-	    !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
-						WQ_MEM_RECLAIM, 0)) ||
-	    bch_journal_alloc(c) ||
-	    bch_btree_cache_alloc(c) ||
-	    bch_open_buckets_alloc(c) ||
-	    bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages)))
+	c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL);
+	if (!c->devices)
+		goto err;
+
+	if (mempool_init_slab_pool(&c->search, 32, bch_search_cache))
+		goto err;
+
+	if (mempool_init_kmalloc_pool(&c->bio_meta, 2,
+			sizeof(struct bbio) +
+			sizeof(struct bio_vec) * bucket_pages(c)))
+		goto err;
+
+	if (mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
+		goto err;
+
+	if (bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio),
+			BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER))
+		goto err;
+
+	c->uuids = alloc_bucket_pages(GFP_KERNEL, c);
+	if (!c->uuids)
+		goto err;
+
+	c->moving_gc_wq = alloc_workqueue("bcache_gc", WQ_MEM_RECLAIM, 0);
+	if (!c->moving_gc_wq)
+		goto err;
+
+	if (bch_journal_alloc(c))
+		goto err;
+
+	if (bch_btree_cache_alloc(c))
+		goto err;
+
+	if (bch_open_buckets_alloc(c))
+		goto err;
+
+	if (bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages)))
 		goto err;
 
 	c->congested_read_threshold_us	= 2000;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 04/16] bcache: fix super block seq numbers comparision in register_cache_set()
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (2 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 03/16] bcache: disassemble the big if() checks in bch_cache_set_alloc() colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 14:30 ` [PATCH v3 05/16] bcache: increase super block version for cache device and backing device colyli
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Hannes Reinecke

From: Coly Li <colyli@suse.de>

In register_cache_set(), c is pointer to struct cache_set, and ca is
pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
registering cache has up to date version and other members, the in-
memory version and other members should be updated to the newer value.

But current implementation makes a cache set only has a single cache
device, so the above assumption works well except for a special case.
The execption is when a cache device new created and both ca->sb.seq and
c->sb.seq are 0, because the super block is never flushed out yet. In
the location for the following if() check,
2156         if (ca->sb.seq > c->sb.seq) {
2157                 c->sb.version           = ca->sb.version;
2158                 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
2159                 c->sb.flags             = ca->sb.flags;
2160                 c->sb.seq               = ca->sb.seq;
2161                 pr_debug("set version = %llu\n", c->sb.version);
2162         }
c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
the if() check will fail (because both values are 0), and the cache set
version, set_uuid, flags and seq won't be updated.

The above problem is hiden for current code, because the bucket size is
compatible among different super block version. And the next time when
running cache set again, ca->sb.seq will be larger than 0 and cache set
super block version will be updated properly.

But if the large bucket feature is enabled,  sb->bucket_size is the low
16bits of the bucket size. For a power of 2 value, when the actual
bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
read_super_common() will fail because the if() check to
is_power_of_2(sb->bucket_size) is false. This is how the long time
hidden bug is triggered.

This patch modifies the if() check to the following way,
2156         if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
Then cache set's version, set_uuid, flags and seq will always be updated
corectly including for a new created cache device.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/bcache/super.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index acee8163f38a..332e1a640f84 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2146,7 +2146,14 @@ static const char *register_cache_set(struct cache *ca)
 	    sysfs_create_link(&c->kobj, &ca->kobj, buf))
 		goto err;
 
-	if (ca->sb.seq > c->sb.seq) {
+	/*
+	 * A special case is both ca->sb.seq and c->sb.seq are 0,
+	 * such condition happens on a new created cache device whose
+	 * super block is never flushed yet. In this case c->sb.version
+	 * and other members should be updated too, otherwise we will
+	 * have a mistaken super block version in cache set.
+	 */
+	if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
 		c->sb.version		= ca->sb.version;
 		memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
 		c->sb.flags             = ca->sb.flags;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 05/16] bcache: increase super block version for cache device and backing device
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (3 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 04/16] bcache: fix super block seq numbers comparision in register_cache_set() colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 14:30 ` [PATCH v3 06/16] bcache: move bucket related code into read_super_common() colyli
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Hannes Reinecke

From: Coly Li <colyli@suse.de>

The new added super block version BCACHE_SB_VERSION_BDEV_WITH_FEATURES
(5) BCACHE_SB_VERSION_CDEV_WITH_FEATURES value (6), is for the feature
set bits.

Devices have super block version equal to the new version will have
three new members for feature set bits in the on-disk super block,
        __le64                  feature_compat;
        __le64                  feature_incompat;
        __le64                  feature_ro_compat;

They are used for further new features which may introduce on-disk
format change, and avoid unncessary super block version increase.

The very basic features handling code skeleton is also initialized in
this patch.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/bcache/features.h | 78 ++++++++++++++++++++++++++++++++++++
 drivers/md/bcache/super.c    | 32 +++++++++++++--
 include/uapi/linux/bcache.h  | 29 ++++++++++----
 3 files changed, 128 insertions(+), 11 deletions(-)
 create mode 100644 drivers/md/bcache/features.h

diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
new file mode 100644
index 000000000000..ae7df37b9862
--- /dev/null
+++ b/drivers/md/bcache/features.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _BCACHE_FEATURES_H
+#define _BCACHE_FEATURES_H
+
+#include <linux/bcache.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+#define BCH_FEATURE_COMPAT		0
+#define BCH_FEATURE_RO_COMPAT		1
+#define BCH_FEATURE_INCOMPAT		2
+#define BCH_FEATURE_TYPE_MASK		0x03
+
+#define BCH_FEATURE_COMPAT_SUUP		0
+#define BCH_FEATURE_RO_COMPAT_SUUP	0
+#define BCH_FEATURE_INCOMPAT_SUUP	0
+
+#define BCH_HAS_COMPAT_FEATURE(sb, mask) \
+		((sb)->feature_compat & (mask))
+#define BCH_HAS_RO_COMPAT_FEATURE(sb, mask) \
+		((sb)->feature_ro_compat & (mask))
+#define BCH_HAS_INCOMPAT_FEATURE(sb, mask) \
+		((sb)->feature_incompat & (mask))
+
+/* Feature set definition */
+
+#define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
+static inline int bch_has_feature_##name(struct cache_sb *sb) \
+{ \
+	return (((sb)->feature_compat & \
+		BCH##_FEATURE_COMPAT_##flagname) != 0); \
+} \
+static inline void bch_set_feature_##name(struct cache_sb *sb) \
+{ \
+	(sb)->feature_compat |= \
+		BCH##_FEATURE_COMPAT_##flagname; \
+} \
+static inline void bch_clear_feature_##name(struct cache_sb *sb) \
+{ \
+	(sb)->feature_compat &= \
+		~BCH##_FEATURE_COMPAT_##flagname; \
+}
+
+#define BCH_FEATURE_RO_COMPAT_FUNCS(name, flagname) \
+static inline int bch_has_feature_##name(struct cache_sb *sb) \
+{ \
+	return (((sb)->feature_ro_compat & \
+		BCH##_FEATURE_RO_COMPAT_##flagname) != 0); \
+} \
+static inline void bch_set_feature_##name(struct cache_sb *sb) \
+{ \
+	(sb)->feature_ro_compat |= \
+		BCH##_FEATURE_RO_COMPAT_##flagname; \
+} \
+static inline void bch_clear_feature_##name(struct cache_sb *sb) \
+{ \
+	(sb)->feature_ro_compat &= \
+		~BCH##_FEATURE_RO_COMPAT_##flagname; \
+}
+
+#define BCH_FEATURE_INCOMPAT_FUNCS(name, flagname) \
+static inline int bch_has_feature_##name(struct cache_sb *sb) \
+{ \
+	return (((sb)->feature_incompat & \
+		BCH##_FEATURE_INCOMPAT_##flagname) != 0); \
+} \
+static inline void bch_set_feature_##name(struct cache_sb *sb) \
+{ \
+	(sb)->feature_incompat |= \
+		BCH##_FEATURE_INCOMPAT_##flagname; \
+} \
+static inline void bch_clear_feature_##name(struct cache_sb *sb) \
+{ \
+	(sb)->feature_incompat &= \
+		~BCH##_FEATURE_INCOMPAT_##flagname; \
+}
+
+#endif
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 332e1a640f84..9f273bd58507 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -13,6 +13,7 @@
 #include "extents.h"
 #include "request.h"
 #include "writeback.h"
+#include "features.h"
 
 #include <linux/blkdev.h>
 #include <linux/debugfs.h>
@@ -194,6 +195,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 		sb->data_offset	= BDEV_DATA_START_DEFAULT;
 		break;
 	case BCACHE_SB_VERSION_BDEV_WITH_OFFSET:
+	case BCACHE_SB_VERSION_BDEV_WITH_FEATURES:
 		sb->data_offset	= le64_to_cpu(s->data_offset);
 
 		err = "Bad data offset";
@@ -207,6 +209,14 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 		if (err)
 			goto err;
 		break;
+	case BCACHE_SB_VERSION_CDEV_WITH_FEATURES:
+		err = read_super_common(sb, bdev, s);
+		if (err)
+			goto err;
+		sb->feature_compat = le64_to_cpu(s->feature_compat);
+		sb->feature_incompat = le64_to_cpu(s->feature_incompat);
+		sb->feature_ro_compat = le64_to_cpu(s->feature_ro_compat);
+		break;
 	default:
 		err = "Unsupported superblock version";
 		goto err;
@@ -241,7 +251,6 @@ static void __write_super(struct cache_sb *sb, struct cache_sb_disk *out,
 			offset_in_page(out));
 
 	out->offset		= cpu_to_le64(sb->offset);
-	out->version		= cpu_to_le64(sb->version);
 
 	memcpy(out->uuid,	sb->uuid, 16);
 	memcpy(out->set_uuid,	sb->set_uuid, 16);
@@ -257,6 +266,13 @@ static void __write_super(struct cache_sb *sb, struct cache_sb_disk *out,
 	for (i = 0; i < sb->keys; i++)
 		out->d[i] = cpu_to_le64(sb->d[i]);
 
+	if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES) {
+		out->feature_compat    = cpu_to_le64(sb->feature_compat);
+		out->feature_incompat  = cpu_to_le64(sb->feature_incompat);
+		out->feature_ro_compat = cpu_to_le64(sb->feature_ro_compat);
+	}
+
+	out->version		= cpu_to_le64(sb->version);
 	out->csum = csum_set(out);
 
 	pr_debug("ver %llu, flags %llu, seq %llu\n",
@@ -313,17 +329,20 @@ void bcache_write_super(struct cache_set *c)
 {
 	struct closure *cl = &c->sb_write;
 	struct cache *ca;
-	unsigned int i;
+	unsigned int i, version = BCACHE_SB_VERSION_CDEV_WITH_UUID;
 
 	down(&c->sb_write_mutex);
 	closure_init(cl, &c->cl);
 
 	c->sb.seq++;
 
+	if (c->sb.version > version)
+		version = c->sb.version;
+
 	for_each_cache(ca, c, i) {
 		struct bio *bio = &ca->sb_bio;
 
-		ca->sb.version		= BCACHE_SB_VERSION_CDEV_WITH_UUID;
+		ca->sb.version		= version;
 		ca->sb.seq		= c->sb.seq;
 		ca->sb.last_mount	= c->sb.last_mount;
 
@@ -1831,6 +1850,13 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	c->sb.bucket_size	= sb->bucket_size;
 	c->sb.nr_in_set		= sb->nr_in_set;
 	c->sb.last_mount	= sb->last_mount;
+	c->sb.version		= sb->version;
+	if (c->sb.version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES) {
+		c->sb.feature_compat = sb->feature_compat;
+		c->sb.feature_ro_compat = sb->feature_ro_compat;
+		c->sb.feature_incompat = sb->feature_incompat;
+	}
+
 	c->bucket_bits		= ilog2(sb->bucket_size);
 	c->block_bits		= ilog2(sb->block_size);
 	c->nr_uuids		= bucket_bytes(c) / sizeof(struct uuid_entry);
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 9a1965c6c3d0..47df2db2e727 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -141,11 +141,13 @@ static inline struct bkey *bkey_idx(const struct bkey *k, unsigned int nr_keys)
  * Version 3: Cache device with new UUID format
  * Version 4: Backing device with data offset
  */
-#define BCACHE_SB_VERSION_CDEV		0
-#define BCACHE_SB_VERSION_BDEV		1
-#define BCACHE_SB_VERSION_CDEV_WITH_UUID 3
-#define BCACHE_SB_VERSION_BDEV_WITH_OFFSET 4
-#define BCACHE_SB_MAX_VERSION		4
+#define BCACHE_SB_VERSION_CDEV			0
+#define BCACHE_SB_VERSION_BDEV			1
+#define BCACHE_SB_VERSION_CDEV_WITH_UUID	3
+#define BCACHE_SB_VERSION_BDEV_WITH_OFFSET	4
+#define BCACHE_SB_VERSION_CDEV_WITH_FEATURES	5
+#define BCACHE_SB_VERSION_BDEV_WITH_FEATURES	6
+#define BCACHE_SB_MAX_VERSION			6
 
 #define SB_SECTOR			8
 #define SB_OFFSET			(SB_SECTOR << SECTOR_SHIFT)
@@ -173,7 +175,12 @@ struct cache_sb_disk {
 
 	__le64			flags;
 	__le64			seq;
-	__le64			pad[8];
+
+	__le64			feature_compat;
+	__le64			feature_incompat;
+	__le64			feature_ro_compat;
+
+	__le64			pad[5];
 
 	union {
 	struct {
@@ -224,7 +231,12 @@ struct cache_sb {
 
 	__u64			flags;
 	__u64			seq;
-	__u64			pad[8];
+
+	__u64			feature_compat;
+	__u64			feature_incompat;
+	__u64			feature_ro_compat;
+
+	__u64			pad[5];
 
 	union {
 	struct {
@@ -262,7 +274,8 @@ struct cache_sb {
 static inline _Bool SB_IS_BDEV(const struct cache_sb *sb)
 {
 	return sb->version == BCACHE_SB_VERSION_BDEV
-		|| sb->version == BCACHE_SB_VERSION_BDEV_WITH_OFFSET;
+		|| sb->version == BCACHE_SB_VERSION_BDEV_WITH_OFFSET
+		|| sb->version == BCACHE_SB_VERSION_BDEV_WITH_FEATURES;
 }
 
 BITMASK(CACHE_SYNC,			struct cache_sb, flags, 0, 1);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 06/16] bcache: move bucket related code into read_super_common()
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (4 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 05/16] bcache: increase super block version for cache device and backing device colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 14:30 ` [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now colyli
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Hannes Reinecke

From: Coly Li <colyli@suse.de>

Setting sb->first_bucket and checking sb->keys indeed are only for cache
device, it does not make sense to do them in read_super() for backing
device too.

This patch moves the related code piece into read_super_common()
explicitly for cache device and avoid the confusion.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/bcache/super.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 9f273bd58507..e8a505ef766d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -66,12 +66,17 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
 	const char *err;
 	unsigned int i;
 
+	sb->first_bucket= le16_to_cpu(s->first_bucket);
 	sb->nbuckets	= le64_to_cpu(s->nbuckets);
 	sb->bucket_size	= le16_to_cpu(s->bucket_size);
 
 	sb->nr_in_set	= le16_to_cpu(s->nr_in_set);
 	sb->nr_this_dev	= le16_to_cpu(s->nr_this_dev);
 
+	err = "Too many journal buckets";
+	if (sb->keys > SB_JOURNAL_BUCKETS)
+		goto err;
+
 	err = "Too many buckets";
 	if (sb->nbuckets > LONG_MAX)
 		goto err;
@@ -155,7 +160,6 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	sb->flags		= le64_to_cpu(s->flags);
 	sb->seq			= le64_to_cpu(s->seq);
 	sb->last_mount		= le32_to_cpu(s->last_mount);
-	sb->first_bucket	= le16_to_cpu(s->first_bucket);
 	sb->keys		= le16_to_cpu(s->keys);
 
 	for (i = 0; i < SB_JOURNAL_BUCKETS; i++)
@@ -172,10 +176,6 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	if (memcmp(sb->magic, bcache_magic, 16))
 		goto err;
 
-	err = "Too many journal buckets";
-	if (sb->keys > SB_JOURNAL_BUCKETS)
-		goto err;
-
 	err = "Bad checksum";
 	if (s->csum != csum_set(s))
 		goto err;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (5 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 06/16] bcache: move bucket related code into read_super_common() colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 18:21   ` Christoph Hellwig
  2020-07-15 14:30 ` [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines colyli
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Hannes Reinecke

From: Coly Li <colyli@suse.de>

We have struct cache_sb_disk for on-disk super block already, it is
unnecessary to keep the in-memory super block format exactly mapping
to the on-disk struct layout.

This patch adds code comments to notice that struct cache_sb is not
exactly mapping to cache_sb_disk, and removes the useless member csum
and pad[5].

Although struct cache_sb does not belong to uapi, but there are still
some on-disk format related macros reference it and it is unncessary to
get rid of such dependency now. So struct cache_sb will continue to stay
in include/uapi/linux/bache.h for now.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 include/uapi/linux/bcache.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 47df2db2e727..0ef984ea515a 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -215,8 +215,13 @@ struct cache_sb_disk {
 	__le64			d[SB_JOURNAL_BUCKETS];	/* journal buckets */
 };
 
+/*
+ * This is for in-memory bcache super block.
+ * NOTE: cache_sb is NOT exactly mapping to cache_sb_disk, the member
+ *       size, ordering and even whole struct size may be different
+ *       from cache_sb_disk.
+ */
 struct cache_sb {
-	__u64			csum;
 	__u64			offset;	/* sector where this sb was written */
 	__u64			version;
 
@@ -236,8 +241,6 @@ struct cache_sb {
 	__u64			feature_incompat;
 	__u64			feature_ro_compat;
 
-	__u64			pad[5];
-
 	union {
 	struct {
 		/* Cache devices */
@@ -245,7 +248,6 @@ struct cache_sb {
 
 		__u16		block_size;	/* sectors */
 		__u16		bucket_size;	/* sectors */
-
 		__u16		nr_in_set;
 		__u16		nr_this_dev;
 	};
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (6 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 15:36   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 09/16] bcache: handle c->uuids properly for bucket size > 8MB colyli
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Currently the in-memory meta data like c->uuids or c->disk_buckets
are allocated by alloc_bucket_pages(). The macro alloc_bucket_pages()
calls __get_free_pages() to allocated continuous pages with order
indicated by ilog2(bucket_pages(c)),
 #define alloc_bucket_pages(gfp, c)                      \
     ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c))))

The maximum order is defined as MAX_ORDER, the default value is 11 (and
can be overwritten by CONFIG_FORCE_MAX_ZONEORDER). In bcache code the
maximum bucket size width is 16bits, this is restricted both by KEY_SIZE
size and bucket_size size from struct cache_sb_disk. The maximum 16bits
width and power-of-2 value is (1<<15) in unit of sector (512byte). It
means the maximum value of bucket size in bytes is (1<<24) bytes a.k.a
4096 pages.

When the bucket size is set to maximum permitted value, ilog2(4096) is
12, which exceeds the default maximum order __get_free_pages() can
accepted, the failed pages allocation will fail cache set registration
procedure and print a kernel oops message for the exceeded pages order.

This patch introduces meta_bucket_pages(), meta_bucket_bytes(), and
alloc_bucket_pages() helper routines. meta_bucket_pages() indicates the
maximum pages can be allocated to meta data bucket, meta_bucket_bytes()
indicates the according maximum bytes, and alloc_bucket_pages() does
the pages allocation for meta bucket. Because meta_bucket_pages()
chooses the smaller value among the bucket size and MAX_ORDER_NR_PAGES,
it still works when MAX_ORDER overwritten by CONFIG_FORCE_MAX_ZONEORDER.

Following patches will use these helper routines to decide maximum pages
can be allocated for different meta data buckets. If the bucket size is
larger than meta_bucket_bytes(), the bcache registration can continue to
success, just the space more than meta_bucket_bytes() inside the bucket
is wasted. Comparing bcache failed for large bucket size, wasting some
space for meta data buckets is acceptable at this moment.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h | 20 ++++++++++++++++++++
 drivers/md/bcache/super.c  |  3 +++
 2 files changed, 23 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 80e3c4813fb0..972f1aff0f70 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -762,6 +762,26 @@ struct bbio {
 #define bucket_bytes(c)		((c)->sb.bucket_size << 9)
 #define block_bytes(c)		((c)->sb.block_size << 9)
 
+static inline unsigned int meta_bucket_pages(struct cache_sb *sb)
+{
+	unsigned int n, max_pages;
+
+	max_pages = min_t(unsigned int,
+			  __rounddown_pow_of_two(USHRT_MAX) / PAGE_SECTORS,
+			  MAX_ORDER_NR_PAGES);
+
+	n = sb->bucket_size / PAGE_SECTORS;
+	if (n > max_pages)
+		n = max_pages;
+
+	return n;
+}
+
+static inline unsigned int meta_bucket_bytes(struct cache_sb *sb)
+{
+	return meta_bucket_pages(sb) << PAGE_SHIFT;
+}
+
 #define prios_per_bucket(c)				\
 	((bucket_bytes(c) - sizeof(struct prio_set)) /	\
 	 sizeof(struct bucket_disk))
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e8a505ef766d..4baec6ac55c8 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1821,6 +1821,9 @@ void bch_cache_set_unregister(struct cache_set *c)
 #define alloc_bucket_pages(gfp, c)			\
 	((void *) __get_free_pages(__GFP_ZERO|__GFP_COMP|gfp, ilog2(bucket_pages(c))))
 
+#define alloc_meta_bucket_pages(gfp, sb)		\
+	((void *) __get_free_pages(__GFP_ZERO|__GFP_COMP|gfp, ilog2(meta_bucket_pages(sb))))
+
 struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 {
 	int iter_size;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 09/16] bcache: handle c->uuids properly for bucket size > 8MB
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (7 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 15:37   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 10/16] bcache: handle cache prio_buckets and disk_buckets " colyli
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Bcache allocates a whole bucket to store c->uuids on cache device, and
allocates continuous pages to store it in-memory. When the bucket size
exceeds maximum allocable continuous pages, bch_cache_set_alloc() will
fail and cache device registration will fail.

This patch allocates c->uuids by alloc_meta_bucket_pages(), and uses
ilog2(meta_bucket_pages(c)) to indicate order of c->uuids pages when
free it. When writing c->uuids to cache device, its size is decided
by meta_bucket_pages(c) * PAGE_SECTORS. Now c->uuids is properly handled
for bucket size > 8MB.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4baec6ac55c8..0c9153fdcc10 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -466,6 +466,7 @@ static int __uuid_write(struct cache_set *c)
 	BKEY_PADDED(key) k;
 	struct closure cl;
 	struct cache *ca;
+	unsigned int size;
 
 	closure_init_stack(&cl);
 	lockdep_assert_held(&bch_register_lock);
@@ -473,7 +474,8 @@ static int __uuid_write(struct cache_set *c)
 	if (bch_bucket_alloc_set(c, RESERVE_BTREE, &k.key, 1, true))
 		return 1;
 
-	SET_KEY_SIZE(&k.key, c->sb.bucket_size);
+	size =  meta_bucket_pages(&c->sb) * PAGE_SECTORS;
+	SET_KEY_SIZE(&k.key, size);
 	uuid_io(c, REQ_OP_WRITE, 0, &k.key, &cl);
 	closure_sync(&cl);
 
@@ -1656,7 +1658,7 @@ static void cache_set_free(struct closure *cl)
 		}
 
 	bch_bset_sort_state_free(&c->sort);
-	free_pages((unsigned long) c->uuids, ilog2(bucket_pages(c)));
+	free_pages((unsigned long) c->uuids, ilog2(meta_bucket_pages(&c->sb)));
 
 	if (c->moving_gc_wq)
 		destroy_workqueue(c->moving_gc_wq);
@@ -1862,7 +1864,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 
 	c->bucket_bits		= ilog2(sb->bucket_size);
 	c->block_bits		= ilog2(sb->block_size);
-	c->nr_uuids		= bucket_bytes(c) / sizeof(struct uuid_entry);
+	c->nr_uuids		= meta_bucket_bytes(&c->sb) / sizeof(struct uuid_entry);
 	c->devices_max_used	= 0;
 	atomic_set(&c->attached_dev_nr, 0);
 	c->btree_pages		= bucket_pages(c);
@@ -1913,7 +1915,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 			BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER))
 		goto err;
 
-	c->uuids = alloc_bucket_pages(GFP_KERNEL, c);
+	c->uuids = alloc_meta_bucket_pages(GFP_KERNEL, &c->sb);
 	if (!c->uuids)
 		goto err;
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 10/16] bcache: handle cache prio_buckets and disk_buckets properly for bucket size > 8MB
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (8 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 09/16] bcache: handle c->uuids properly for bucket size > 8MB colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-15 15:38   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 11/16] bcache: handle cache set verify_ondisk " colyli
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Similar to c->uuids, struct cache's prio_buckets and disk_buckets also
have the potential memory allocation failure during cache registration
if the bucket size > 8MB.

ca->prio_buckets can be stored on cache device in multiple buckets, its
in-memory space is allocated by kzalloc() interface but normally
allocated by alloc_pages() because the size > KMALLOC_MAX_CACHE_SIZE.

So allocation of ca->prio_buckets has the MAX_ORDER restriction too. If
the bucket size > 8MB, by default the page allocator will fail because
the page order > 11 (default MAX_ORDER value). ca->prio_buckets should
also use meta_bucket_bytes(), meta_bucket_pages() to decide its memory
size and use alloc_meta_bucket_pages() to allocate pages, to avoid the
allocation failure during cache set registration when bucket size > 8MB.

ca->disk_buckets is a single bucket size memory buffer, it is used to
iterate each bucket of ca->prio_buckets, and compose the bio based on
memory of ca->disk_buckets, then write ca->disk_buckets memory to cache
disk one-by-one for each bucket of ca->prio_buckets. ca->disk_buckets
should have in-memory size exact to the meta_bucket_pages(), this is the
size that ca->prio_buckets will be stored into each on-disk bucket.

This patch fixes the above issues and handle cache's prio_buckets and
disk_buckets properly for bucket size larger than 8MB.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h |  9 +++++----
 drivers/md/bcache/super.c  | 10 +++++-----
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 972f1aff0f70..0ebfda284866 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -782,11 +782,12 @@ static inline unsigned int meta_bucket_bytes(struct cache_sb *sb)
 	return meta_bucket_pages(sb) << PAGE_SHIFT;
 }
 
-#define prios_per_bucket(c)				\
-	((bucket_bytes(c) - sizeof(struct prio_set)) /	\
+#define prios_per_bucket(ca)						\
+	((meta_bucket_bytes(&(ca)->sb) - sizeof(struct prio_set)) /	\
 	 sizeof(struct bucket_disk))
-#define prio_buckets(c)					\
-	DIV_ROUND_UP((size_t) (c)->sb.nbuckets, prios_per_bucket(c))
+
+#define prio_buckets(ca)						\
+	DIV_ROUND_UP((size_t) (ca)->sb.nbuckets, prios_per_bucket(ca))
 
 static inline size_t sector_to_bucket(struct cache_set *c, sector_t s)
 {
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0c9153fdcc10..b1d8388ef57d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -563,7 +563,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, int op,
 
 	bio->bi_iter.bi_sector	= bucket * ca->sb.bucket_size;
 	bio_set_dev(bio, ca->bdev);
-	bio->bi_iter.bi_size	= bucket_bytes(ca);
+	bio->bi_iter.bi_size	= meta_bucket_bytes(&ca->sb);
 
 	bio->bi_end_io	= prio_endio;
 	bio->bi_private = ca;
@@ -621,7 +621,7 @@ int bch_prio_write(struct cache *ca, bool wait)
 
 		p->next_bucket	= ca->prio_buckets[i + 1];
 		p->magic	= pset_magic(&ca->sb);
-		p->csum		= bch_crc64(&p->magic, bucket_bytes(ca) - 8);
+		p->csum		= bch_crc64(&p->magic, meta_bucket_bytes(&ca->sb) - 8);
 
 		bucket = bch_bucket_alloc(ca, RESERVE_PRIO, wait);
 		BUG_ON(bucket == -1);
@@ -674,7 +674,7 @@ static int prio_read(struct cache *ca, uint64_t bucket)
 			prio_io(ca, bucket, REQ_OP_READ, 0);
 
 			if (p->csum !=
-			    bch_crc64(&p->magic, bucket_bytes(ca) - 8)) {
+			    bch_crc64(&p->magic, meta_bucket_bytes(&ca->sb) - 8)) {
 				pr_warn("bad csum reading priorities\n");
 				goto out;
 			}
@@ -2222,7 +2222,7 @@ void bch_cache_release(struct kobject *kobj)
 		ca->set->cache[ca->sb.nr_this_dev] = NULL;
 	}
 
-	free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
+	free_pages((unsigned long) ca->disk_buckets, ilog2(meta_bucket_pages(&ca->sb)));
 	kfree(ca->prio_buckets);
 	vfree(ca->buckets);
 
@@ -2319,7 +2319,7 @@ static int cache_alloc(struct cache *ca)
 		goto err_prio_buckets_alloc;
 	}
 
-	ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca);
+	ca->disk_buckets = alloc_meta_bucket_pages(GFP_KERNEL, &ca->sb);
 	if (!ca->disk_buckets) {
 		err = "ca->disk_buckets alloc failed";
 		goto err_disk_buckets_alloc;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 11/16] bcache: handle cache set verify_ondisk properly for bucket size > 8MB
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (9 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 10/16] bcache: handle cache prio_buckets and disk_buckets " colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-16  6:07   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 12/16] bcache: handle btree node memory allocation " colyli
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

In bch_btree_cache_alloc() when CONFIG_BCACHE_DEBUG is configured,
allocate memory for c->verify_ondisk may fail if the bucket size > 8MB,
which will require __get_free_pages() to allocate continuous pages
with order > 11 (the default MAX_ORDER of Linux buddy allocator). Such
over size allocation will fail, and cause 2 problems,
- When CONFIG_BCACHE_DEBUG is configured,  bch_btree_verify() does not
  work, because c->verify_ondisk is NULL and bch_btree_verify() returns
  immediately.
- bch_btree_cache_alloc() will fail due to c->verify_ondisk allocation
  failed, then the whole cache device registration fails. And because of
  this failure, the first problem of bch_btree_verify() has no chance to
  be triggered.

This patch fixes the above problem by two means,
1) If pages allocation of c->verify_ondisk fails, set it to NULL and
   returns bch_btree_cache_alloc() with -ENOMEM.
2) When calling __get_free_pages() to allocate c->verify_ondisk pages,
   use ilog2(meta_bucket_pages(&c->sb)) to make sure ilog2() will always
   generate a pages order <= MAX_ORDER (or CONFIG_FORCE_MAX_ZONEORDER).
   Then the buddy system won't directly reject the allocation request.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index dd116c83de80..79716ac9fb5d 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -738,7 +738,7 @@ void bch_btree_cache_free(struct cache_set *c)
 	if (c->verify_data)
 		list_move(&c->verify_data->list, &c->btree_cache);
 
-	free_pages((unsigned long) c->verify_ondisk, ilog2(bucket_pages(c)));
+	free_pages((unsigned long) c->verify_ondisk, ilog2(meta_bucket_pages(&c->sb)));
 #endif
 
 	list_splice(&c->btree_cache_freeable,
@@ -785,7 +785,15 @@ int bch_btree_cache_alloc(struct cache_set *c)
 	mutex_init(&c->verify_lock);
 
 	c->verify_ondisk = (void *)
-		__get_free_pages(GFP_KERNEL|__GFP_COMP, ilog2(bucket_pages(c)));
+		__get_free_pages(GFP_KERNEL|__GFP_COMP, ilog2(meta_bucket_pages(&c->sb)));
+	if (!c->verify_ondisk) {
+		/*
+		 * Don't worry about the mca_rereserve buckets
+		 * allocated in previous for-loop, they will be
+		 * handled properly in bch_cache_set_unregister().
+		 */
+		return -ENOMEM;
+	}
 
 	c->verify_data = mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL);
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 12/16] bcache: handle btree node memory allocation properly for bucket size > 8MB
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (10 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 11/16] bcache: handle cache set verify_ondisk " colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-16  6:08   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket colyli
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Currently the bcache internal btree node occupies a whole bucket. When
loading the btree node from cache device into memory, mca_data_alloc()
will call bch_btree_keys_alloc() to allocate memory for the whole bucket
size, ilog2(b->c->btree_pages) is send to bch_btree_keys_alloc() as the
parameter 'page_order'.

c->btree_pages is set as bucket_pages() in bch_cache_set_alloc(), for
bucket size > 8MB, ilog2(b->c->btree_pages) is 12 for 4KB page size. By
default the maximum page order __get_free_pages() accepts is MAX_ORDER
(11), in this condition bch_btree_keys_alloc() will always fail.

Because of other over-page-order allocation failure fails the cache
device registration, such btree node allocation failure wasn't observed
during runtime. After other blocking page allocation failures for bucket
size > 8MB, this btree node allocation issue may trigger potentical risk
e.g. infinite dead-loop to retry btree node allocation after failure.

This patch fixes the potential problem by setting c->btree_pages to
meta_bucket_pages() in bch_cache_set_alloc(). In the condition that
bucket size > 8MB, meta_bucket_pages() will always return a number which
won't exceed the maximum page order of the buddy allocator.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b1d8388ef57d..02901d0ae8e2 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1867,7 +1867,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	c->nr_uuids		= meta_bucket_bytes(&c->sb) / sizeof(struct uuid_entry);
 	c->devices_max_used	= 0;
 	atomic_set(&c->attached_dev_nr, 0);
-	c->btree_pages		= bucket_pages(c);
+	c->btree_pages		= meta_bucket_pages(&c->sb);
 	if (c->btree_pages > BTREE_MAX_PAGES)
 		c->btree_pages = max_t(int, c->btree_pages / 4,
 				       BTREE_MAX_PAGES);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (11 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 12/16] bcache: handle btree node memory allocation " colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-16  6:15   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set colyli
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

The large bucket feature is to extend bucket_size from 16bit to 32bit.

When create cache device on zoned device (e.g. zoned NVMe SSD), making
a single bucket cover one or more zones of the zoned device is the
simplest way to support zoned device as cache by bcache.

But current maximum bucket size is 16MB and a typical zone size of zoned
device is 256MB, this is the major motiviation to extend bucket size to
a larger bit width.

This patch is the basic and first change to support large bucket size,
the major changes it makes are,
- Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
  INCOMPAT means it introduces incompatible on-disk format change.
- Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
- Adds __le32 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
  for the on-disk super block format.
- For the in-memory super block struct cache_sb, member bucket_size is
  extended from __u16 to __32.
- Add get_bucket_size() to combine the bucket_size and bucket_size_hi
  from struct cache_sb_disk into an unsigned int value.

Since we already have large bucket size helpers meta_bucket_pages(),
meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
bucket size > 8MB, the memory allocation for bcache meta data bucket
won't fail no matter how large the bucket size extended. So these meta
data buckets are handled properly when the bucket size width increase
from 16bit to 32bit, we don't need to worry about them.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/alloc.c    |  2 +-
 drivers/md/bcache/features.c | 22 ++++++++++++++++++++++
 drivers/md/bcache/features.h |  9 ++++++---
 drivers/md/bcache/movinggc.c |  4 ++--
 drivers/md/bcache/super.c    | 23 +++++++++++++++++++----
 include/uapi/linux/bcache.h  |  3 ++-
 6 files changed, 52 insertions(+), 11 deletions(-)
 create mode 100644 drivers/md/bcache/features.c

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a1df0d95151c..52035a78d836 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -87,7 +87,7 @@ void bch_rescale_priorities(struct cache_set *c, int sectors)
 {
 	struct cache *ca;
 	struct bucket *b;
-	unsigned int next = c->nbuckets * c->sb.bucket_size / 1024;
+	unsigned long next = c->nbuckets * c->sb.bucket_size / 1024;
 	unsigned int i;
 	int r;
 
diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
new file mode 100644
index 000000000000..ba53944bb390
--- /dev/null
+++ b/drivers/md/bcache/features.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Feature set bits and string conversion.
+ * Inspired by ext4's features compat/incompat/ro_compat related code.
+ *
+ * Copyright 2020 Coly Li <colyli@suse.de>
+ *
+ */
+#include <linux/bcache.h>
+#include "bcache.h"
+
+struct feature {
+	int		compat;
+	unsigned int	mask;
+	const char	*string;
+};
+
+static struct feature feature_list[] = {
+	{BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
+		"large_bucket"},
+	{0, 0, 0 },
+};
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
index ae7df37b9862..dca052cf5203 100644
--- a/drivers/md/bcache/features.h
+++ b/drivers/md/bcache/features.h
@@ -11,9 +11,13 @@
 #define BCH_FEATURE_INCOMPAT		2
 #define BCH_FEATURE_TYPE_MASK		0x03
 
+/* Feature set definition */
+/* Incompat feature set */
+#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET	0x0001 /* 32bit bucket size */
+
 #define BCH_FEATURE_COMPAT_SUUP		0
 #define BCH_FEATURE_RO_COMPAT_SUUP	0
-#define BCH_FEATURE_INCOMPAT_SUUP	0
+#define BCH_FEATURE_INCOMPAT_SUUP	BCH_FEATURE_INCOMPAT_LARGE_BUCKET
 
 #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
 		((sb)->feature_compat & (mask))
@@ -22,8 +26,6 @@
 #define BCH_HAS_INCOMPAT_FEATURE(sb, mask) \
 		((sb)->feature_incompat & (mask))
 
-/* Feature set definition */
-
 #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
 static inline int bch_has_feature_##name(struct cache_sb *sb) \
 { \
@@ -75,4 +77,5 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
 		~BCH##_FEATURE_INCOMPAT_##flagname; \
 }
 
+BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
 #endif
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index b7dd2d75f58c..5872d6470470 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -206,8 +206,8 @@ void bch_moving_gc(struct cache_set *c)
 	mutex_lock(&c->bucket_lock);
 
 	for_each_cache(ca, c, i) {
-		unsigned int sectors_to_move = 0;
-		unsigned int reserve_sectors = ca->sb.bucket_size *
+		unsigned long sectors_to_move = 0;
+		unsigned long reserve_sectors = ca->sb.bucket_size *
 			     fifo_used(&ca->free[RESERVE_MOVINGGC]);
 
 		ca->heap.used = 0;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 02901d0ae8e2..e0da52f8e8c9 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -60,6 +60,17 @@ struct workqueue_struct *bch_journal_wq;
 
 /* Superblock */
 
+static unsigned int get_bucket_size(struct cache_sb *sb, struct cache_sb_disk *s)
+{
+	unsigned int bucket_size = le16_to_cpu(s->bucket_size);
+
+	if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
+	     bch_has_feature_large_bucket(sb))
+		bucket_size |= le32_to_cpu(s->bucket_size_hi) << 16;
+
+	return bucket_size;
+}
+
 static const char *read_super_common(struct cache_sb *sb,  struct block_device *bdev,
 				     struct cache_sb_disk *s)
 {
@@ -68,7 +79,7 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
 
 	sb->first_bucket= le16_to_cpu(s->first_bucket);
 	sb->nbuckets	= le64_to_cpu(s->nbuckets);
-	sb->bucket_size	= le16_to_cpu(s->bucket_size);
+	sb->bucket_size	= get_bucket_size(sb, s);
 
 	sb->nr_in_set	= le16_to_cpu(s->nr_in_set);
 	sb->nr_this_dev	= le16_to_cpu(s->nr_this_dev);
@@ -210,12 +221,16 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 			goto err;
 		break;
 	case BCACHE_SB_VERSION_CDEV_WITH_FEATURES:
-		err = read_super_common(sb, bdev, s);
-		if (err)
-			goto err;
+		/*
+		 * Feature bits are needed in read_super_common(),
+		 * convert them firstly.
+		 */
 		sb->feature_compat = le64_to_cpu(s->feature_compat);
 		sb->feature_incompat = le64_to_cpu(s->feature_incompat);
 		sb->feature_ro_compat = le64_to_cpu(s->feature_ro_compat);
+		err = read_super_common(sb, bdev, s);
+		if (err)
+			goto err;
 		break;
 	default:
 		err = "Unsupported superblock version";
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 0ef984ea515a..3682b43fdf1e 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -213,6 +213,7 @@ struct cache_sb_disk {
 		__le16		keys;
 	};
 	__le64			d[SB_JOURNAL_BUCKETS];	/* journal buckets */
+	__le32			bucket_size_hi;
 };
 
 /*
@@ -247,9 +248,9 @@ struct cache_sb {
 		__u64		nbuckets;	/* device size */
 
 		__u16		block_size;	/* sectors */
-		__u16		bucket_size;	/* sectors */
 		__u16		nr_in_set;
 		__u16		nr_this_dev;
+		__u32		bucket_size;	/* sectors */
 	};
 	struct {
 		/* Backing devices */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (12 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-16  6:17   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 15/16] bcache: avoid extra memory allocation from mempool c->fill_iter colyli
  2020-07-15 14:30 ` [PATCH v3 16/16] bcache: avoid extra memory consumption in struct bbio for large bucket size colyli
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

A new sysfs file /sys/fs/bcache/<cache set UUID>/internal/feature_sets
is added by this patch, to display feature sets information of the cache
set.

Now only an incompat feature 'large_bucket' added in bcache, the sysfs
file content is:
	feature_incompat: [large_bucket]
string large_bucket means the running bcache drive supports incompat
feature 'large_bucket', the wrapping [] means the 'large_bucket' feature
is currently enabled on this cache set.

This patch is ready to display compat and ro_compat features, in future
once bcache code implements such feature sets, the according feature
strings will be displayed in this sysfs file too.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/Makefile   |  2 +-
 drivers/md/bcache/features.c | 48 ++++++++++++++++++++++++++++++++++++
 drivers/md/bcache/features.h |  3 +++
 drivers/md/bcache/sysfs.c    |  6 +++++
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index fd714628da6a..5b87e59676b8 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
 
 bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
 	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
-	util.o writeback.o
+	util.o writeback.o features.o
diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
index ba53944bb390..5c601635e11c 100644
--- a/drivers/md/bcache/features.c
+++ b/drivers/md/bcache/features.c
@@ -8,6 +8,7 @@
  */
 #include <linux/bcache.h>
 #include "bcache.h"
+#include "features.h"
 
 struct feature {
 	int		compat;
@@ -20,3 +21,50 @@ static struct feature feature_list[] = {
 		"large_bucket"},
 	{0, 0, 0 },
 };
+
+#define compose_feature_string(type, head)				\
+({									\
+	struct feature *f;						\
+	bool first = true;						\
+									\
+	for (f = &feature_list[0]; f->compat != 0; f++) {		\
+		if (f->compat != BCH_FEATURE_ ## type)			\
+			continue;					\
+		if (BCH_HAS_ ## type ## _FEATURE(&c->sb, f->mask)) {	\
+			if (first) {					\
+				out += snprintf(out, buf + size - out,	\
+						"%s%s", head, ": [");	\
+			} else {					\
+				out += snprintf(out, buf + size - out,	\
+						" [");			\
+			}						\
+		} else {						\
+			if (first)					\
+				out += snprintf(out, buf + size - out,	\
+						"%s%s", head, ": ");	\
+			else						\
+				out += snprintf(out, buf + size - out,	\
+						" ");			\
+		}							\
+									\
+		out += snprintf(out, buf + size - out, "%s", f->string);\
+									\
+		if (BCH_HAS_ ## type ## _FEATURE(&c->sb, f->mask))	\
+			out += snprintf(out, buf + size - out, "]");	\
+									\
+		first = false;						\
+	}								\
+	if (!first)							\
+		out += snprintf(out, buf + size - out, "\n");		\
+})
+
+int bch_print_cache_set_feature_set(struct cache_set *c, char *buf, int size)
+{
+	char *out = buf;
+
+	compose_feature_string(COMPAT, "feature_compat");
+	compose_feature_string(RO_COMPAT, "feature_ro_compat");
+	compose_feature_string(INCOMPAT, "feature_incompat");
+
+	return out - buf;
+}
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
index dca052cf5203..350a4f413136 100644
--- a/drivers/md/bcache/features.h
+++ b/drivers/md/bcache/features.h
@@ -78,4 +78,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
 }
 
 BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
+
+int bch_print_cache_set_feature_set(struct cache_set *c, char *buf, int size);
+
 #endif
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 0dadec5a78f6..e3633f06d43b 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -11,6 +11,7 @@
 #include "btree.h"
 #include "request.h"
 #include "writeback.h"
+#include "features.h"
 
 #include <linux/blkdev.h>
 #include <linux/sort.h>
@@ -88,6 +89,7 @@ read_attribute(btree_used_percent);
 read_attribute(average_key_size);
 read_attribute(dirty_data);
 read_attribute(bset_tree_stats);
+read_attribute(feature_sets);
 
 read_attribute(state);
 read_attribute(cache_read_races);
@@ -779,6 +781,9 @@ SHOW(__bch_cache_set)
 	if (attr == &sysfs_bset_tree_stats)
 		return bch_bset_print_stats(c, buf);
 
+	if (attr == &sysfs_feature_sets)
+		return bch_print_cache_set_feature_set(c, buf, PAGE_SIZE);
+
 	return 0;
 }
 SHOW_LOCKED(bch_cache_set)
@@ -987,6 +992,7 @@ static struct attribute *bch_cache_set_internal_files[] = {
 	&sysfs_io_disable,
 	&sysfs_cutoff_writeback,
 	&sysfs_cutoff_writeback_sync,
+	&sysfs_feature_sets,
 	NULL
 };
 KTYPE(bch_cache_set_internal);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 15/16] bcache: avoid extra memory allocation from mempool c->fill_iter
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (13 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-16  6:18   ` Hannes Reinecke
  2020-07-15 14:30 ` [PATCH v3 16/16] bcache: avoid extra memory consumption in struct bbio for large bucket size colyli
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Mempool c->fill_iter is used to allocate memory for struct btree_iter in
bch_btree_node_read_done() to iterate all keys of a read-in btree node.

The allocation size is defined in bch_cache_set_alloc() by,
  mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
where iter_size is defined by a calculation,
  (sb->bucket_size / sb->block_size + 1) * sizeof(struct btree_iter_set)

For 16bit width bucket_size the calculation is OK, but now the bucket
size is extended to 32bit, the bucket size can be 2GB. By the above
calculation, iter_size can be 2048 pages (order 11 is still accepted by
buddy allocator).

But the actual size holds the bkeys in meta data bucket is limited to
meta_bucket_pages() already, which is 16MB. By the above calculation,
if replace sb->bucket_size by meta_bucket_pages() * PAGE_SECTORS, the
result is 16 pages. This is the size large enough for the mempool
allocation to struct btree_iter.

Therefore in worst case every time mempool c->fill_iter allocates, at
most 4080 pages are wasted and won't be used. Therefore this patch uses
meta_bucket_pages() * PAGE_SECTORS to calculate the iter size in
bch_cache_set_alloc(), to avoid extra memory allocation from mempool
c->fill_iter.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e0da52f8e8c9..90494c7dead8 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1908,7 +1908,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	INIT_LIST_HEAD(&c->btree_cache_freed);
 	INIT_LIST_HEAD(&c->data_buckets);
 
-	iter_size = (sb->bucket_size / sb->block_size + 1) *
+	iter_size = ((meta_bucket_pages(sb) * PAGE_SECTORS) / sb->block_size + 1) *
 		sizeof(struct btree_iter_set);
 
 	c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 16/16] bcache: avoid extra memory consumption in struct bbio for large bucket size
  2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
                   ` (14 preceding siblings ...)
  2020-07-15 14:30 ` [PATCH v3 15/16] bcache: avoid extra memory allocation from mempool c->fill_iter colyli
@ 2020-07-15 14:30 ` colyli
  2020-07-16  6:18   ` Hannes Reinecke
  15 siblings, 1 reply; 33+ messages in thread
From: colyli @ 2020-07-15 14:30 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

From: Coly Li <colyli@suse.de>

Bcache uses struct bbio to do I/Os for meta data pages like uuids,
disk_buckets, prio_buckets, and btree nodes.

Example writing a btree node onto cache device, the process is,
- Allocate a struct bbio from mempool c->bio_meta.
- Inside struct bbio embedded a struct bio, initialize bi_inline_vecs
  for this embedded bio.
- Call bch_bio_map() to map each meta data page to each bv from the
  inlined  bi_io_vec table.
- Call bch_submit_bbio() to submit the bio into underlying block layer.
- When the I/O completed, only release the struct bbio, don't touch the
  reference counter of the meta data pages.

The struct bbio is defined as,
738 struct bbio {
739     unsigned int            submit_time_us;
	[snipped]
748     struct bio              bio;
749 };

Because struct bio is embedded at the end of struct bbio, therefore the
actual size of struct bbio is sizeof(struct bio) + size of the embedded
bio->bi_inline_vecs.

Now all the meta data bucket size are limited to meta_bucket_pages(), if
the bucket size is large than meta_bucket_pages()*PAGE_SECTORS, rested
space in the bucket is unused. Therefore the most used space in meta
bucket is (1<<MAX_ORDER) pages, or (1<<CONFIG_FORCE_MAX_ZONEORDER) if it
is configured.

Therefore for large bucket size, it is unnecessary to calculate the
allocation size of mempool c->bio_meta as,
	mempool_init_kmalloc_pool(&c->bio_meta, 2,
			sizeof(struct bbio) +
			sizeof(struct bio_vec) * bucket_pages(c))
It is too large, neither the Linux buddy allocator cannot allocate so
much continuous pages, nor the extra allocated pages are wasted.

This patch replace bucket_pages() to meta_bucket_pages() in two places,
- In bch_cache_set_alloc(), when initialize mempool c->bio_meta, uses
  sizeof(struct bbio) + sizeof(struct bio_vec) * bucket_pages(c) to set
  the allocating object size.
- In bch_bbio_alloc(), when calling bio_init() to set inline bvec talbe
  bi_inline_bvecs, uses meta_bucket_pages() to indicate number of the
  inline bio vencs number.

Now the maximum size of embedded bio inside struct bbio exactly matches
the limit of meta_bucket_pages(), no extra page wasted.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/io.c    | 2 +-
 drivers/md/bcache/super.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index b25ee33b0d0b..a14a445618b4 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -26,7 +26,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
 	struct bbio *b = mempool_alloc(&c->bio_meta, GFP_NOIO);
 	struct bio *bio = &b->bio;
 
-	bio_init(bio, bio->bi_inline_vecs, bucket_pages(c));
+	bio_init(bio, bio->bi_inline_vecs, meta_bucket_pages(&c->sb));
 
 	return bio;
 }
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 90494c7dead8..cade3f09661d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1920,7 +1920,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 
 	if (mempool_init_kmalloc_pool(&c->bio_meta, 2,
 			sizeof(struct bbio) +
-			sizeof(struct bio_vec) * bucket_pages(c)))
+			sizeof(struct bio_vec) * meta_bucket_pages(&c->sb)))
 		goto err;
 
 	if (mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines
  2020-07-15 14:30 ` [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines colyli
@ 2020-07-15 15:36   ` Hannes Reinecke
  2020-07-15 16:00     ` Coly Li
  0 siblings, 1 reply; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-15 15:36 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> Currently the in-memory meta data like c->uuids or c->disk_buckets
> are allocated by alloc_bucket_pages(). The macro alloc_bucket_pages()
> calls __get_free_pages() to allocated continuous pages with order
> indicated by ilog2(bucket_pages(c)),
>   #define alloc_bucket_pages(gfp, c)                      \
>       ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c))))
> 
> The maximum order is defined as MAX_ORDER, the default value is 11 (and
> can be overwritten by CONFIG_FORCE_MAX_ZONEORDER). In bcache code the
> maximum bucket size width is 16bits, this is restricted both by KEY_SIZE
> size and bucket_size size from struct cache_sb_disk. The maximum 16bits
> width and power-of-2 value is (1<<15) in unit of sector (512byte). It
> means the maximum value of bucket size in bytes is (1<<24) bytes a.k.a
> 4096 pages.
> 
> When the bucket size is set to maximum permitted value, ilog2(4096) is
> 12, which exceeds the default maximum order __get_free_pages() can
> accepted, the failed pages allocation will fail cache set registration
> procedure and print a kernel oops message for the exceeded pages order.
> 
> This patch introduces meta_bucket_pages(), meta_bucket_bytes(), and
> alloc_bucket_pages() helper routines. meta_bucket_pages() indicates the
> maximum pages can be allocated to meta data bucket, meta_bucket_bytes()
> indicates the according maximum bytes, and alloc_bucket_pages() does
> the pages allocation for meta bucket. Because meta_bucket_pages()
> chooses the smaller value among the bucket size and MAX_ORDER_NR_PAGES,
> it still works when MAX_ORDER overwritten by CONFIG_FORCE_MAX_ZONEORDER.
> 
> Following patches will use these helper routines to decide maximum pages
> can be allocated for different meta data buckets. If the bucket size is
> larger than meta_bucket_bytes(), the bcache registration can continue to
> success, just the space more than meta_bucket_bytes() inside the bucket
> is wasted. Comparing bcache failed for large bucket size, wasting some
> space for meta data buckets is acceptable at this moment.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/bcache.h | 20 ++++++++++++++++++++
>   drivers/md/bcache/super.c  |  3 +++
>   2 files changed, 23 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 80e3c4813fb0..972f1aff0f70 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -762,6 +762,26 @@ struct bbio {
>   #define bucket_bytes(c)		((c)->sb.bucket_size << 9)
>   #define block_bytes(c)		((c)->sb.block_size << 9)
>   
> +static inline unsigned int meta_bucket_pages(struct cache_sb *sb)
> +{
> +	unsigned int n, max_pages;
> +
> +	max_pages = min_t(unsigned int,
> +			  __rounddown_pow_of_two(USHRT_MAX) / PAGE_SECTORS,
> +			  MAX_ORDER_NR_PAGES);
> +
> +	n = sb->bucket_size / PAGE_SECTORS;
> +	if (n > max_pages)
> +		n = max_pages;
> +
> +	return n;
> +}
> +
> +static inline unsigned int meta_bucket_bytes(struct cache_sb *sb)
> +{
> +	return meta_bucket_pages(sb) << PAGE_SHIFT;
> +}
> +
>   #define prios_per_bucket(c)				\
>   	((bucket_bytes(c) - sizeof(struct prio_set)) /	\
>   	 sizeof(struct bucket_disk))
I'm not particular happy with the division followed by a shift; might be 
an idea to replace the division by a shift.

But that's minor, so:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 09/16] bcache: handle c->uuids properly for bucket size > 8MB
  2020-07-15 14:30 ` [PATCH v3 09/16] bcache: handle c->uuids properly for bucket size > 8MB colyli
@ 2020-07-15 15:37   ` Hannes Reinecke
  0 siblings, 0 replies; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-15 15:37 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> Bcache allocates a whole bucket to store c->uuids on cache device, and
> allocates continuous pages to store it in-memory. When the bucket size
> exceeds maximum allocable continuous pages, bch_cache_set_alloc() will
> fail and cache device registration will fail.
> 
> This patch allocates c->uuids by alloc_meta_bucket_pages(), and uses
> ilog2(meta_bucket_pages(c)) to indicate order of c->uuids pages when
> free it. When writing c->uuids to cache device, its size is decided
> by meta_bucket_pages(c) * PAGE_SECTORS. Now c->uuids is properly handled
> for bucket size > 8MB.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 10/16] bcache: handle cache prio_buckets and disk_buckets properly for bucket size > 8MB
  2020-07-15 14:30 ` [PATCH v3 10/16] bcache: handle cache prio_buckets and disk_buckets " colyli
@ 2020-07-15 15:38   ` Hannes Reinecke
  0 siblings, 0 replies; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-15 15:38 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> Similar to c->uuids, struct cache's prio_buckets and disk_buckets also
> have the potential memory allocation failure during cache registration
> if the bucket size > 8MB.
> 
> ca->prio_buckets can be stored on cache device in multiple buckets, its
> in-memory space is allocated by kzalloc() interface but normally
> allocated by alloc_pages() because the size > KMALLOC_MAX_CACHE_SIZE.
> 
> So allocation of ca->prio_buckets has the MAX_ORDER restriction too. If
> the bucket size > 8MB, by default the page allocator will fail because
> the page order > 11 (default MAX_ORDER value). ca->prio_buckets should
> also use meta_bucket_bytes(), meta_bucket_pages() to decide its memory
> size and use alloc_meta_bucket_pages() to allocate pages, to avoid the
> allocation failure during cache set registration when bucket size > 8MB.
> 
> ca->disk_buckets is a single bucket size memory buffer, it is used to
> iterate each bucket of ca->prio_buckets, and compose the bio based on
> memory of ca->disk_buckets, then write ca->disk_buckets memory to cache
> disk one-by-one for each bucket of ca->prio_buckets. ca->disk_buckets
> should have in-memory size exact to the meta_bucket_pages(), this is the
> size that ca->prio_buckets will be stored into each on-disk bucket.
> 
> This patch fixes the above issues and handle cache's prio_buckets and
> disk_buckets properly for bucket size larger than 8MB.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/bcache.h |  9 +++++----
>   drivers/md/bcache/super.c  | 10 +++++-----
>   2 files changed, 10 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines
  2020-07-15 15:36   ` Hannes Reinecke
@ 2020-07-15 16:00     ` Coly Li
  0 siblings, 0 replies; 33+ messages in thread
From: Coly Li @ 2020-07-15 16:00 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2020/7/15 23:36, Hannes Reinecke wrote:
> On 7/15/20 4:30 PM, colyli@suse.de wrote:
>> From: Coly Li <colyli@suse.de>
>>
>> Currently the in-memory meta data like c->uuids or c->disk_buckets
>> are allocated by alloc_bucket_pages(). The macro alloc_bucket_pages()
>> calls __get_free_pages() to allocated continuous pages with order
>> indicated by ilog2(bucket_pages(c)),
>>   #define alloc_bucket_pages(gfp, c)                      \
>>       ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c))))
>>
>> The maximum order is defined as MAX_ORDER, the default value is 11 (and
>> can be overwritten by CONFIG_FORCE_MAX_ZONEORDER). In bcache code the
>> maximum bucket size width is 16bits, this is restricted both by KEY_SIZE
>> size and bucket_size size from struct cache_sb_disk. The maximum 16bits
>> width and power-of-2 value is (1<<15) in unit of sector (512byte). It
>> means the maximum value of bucket size in bytes is (1<<24) bytes a.k.a
>> 4096 pages.
>>
>> When the bucket size is set to maximum permitted value, ilog2(4096) is
>> 12, which exceeds the default maximum order __get_free_pages() can
>> accepted, the failed pages allocation will fail cache set registration
>> procedure and print a kernel oops message for the exceeded pages order.
>>
>> This patch introduces meta_bucket_pages(), meta_bucket_bytes(), and
>> alloc_bucket_pages() helper routines. meta_bucket_pages() indicates the
>> maximum pages can be allocated to meta data bucket, meta_bucket_bytes()
>> indicates the according maximum bytes, and alloc_bucket_pages() does
>> the pages allocation for meta bucket. Because meta_bucket_pages()
>> chooses the smaller value among the bucket size and MAX_ORDER_NR_PAGES,
>> it still works when MAX_ORDER overwritten by CONFIG_FORCE_MAX_ZONEORDER.
>>
>> Following patches will use these helper routines to decide maximum pages
>> can be allocated for different meta data buckets. If the bucket size is
>> larger than meta_bucket_bytes(), the bcache registration can continue to
>> success, just the space more than meta_bucket_bytes() inside the bucket
>> is wasted. Comparing bcache failed for large bucket size, wasting some
>> space for meta data buckets is acceptable at this moment.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/bcache.h | 20 ++++++++++++++++++++
>>   drivers/md/bcache/super.c  |  3 +++
>>   2 files changed, 23 insertions(+)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 80e3c4813fb0..972f1aff0f70 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -762,6 +762,26 @@ struct bbio {
>>   #define bucket_bytes(c)        ((c)->sb.bucket_size << 9)
>>   #define block_bytes(c)        ((c)->sb.block_size << 9)
>>   +static inline unsigned int meta_bucket_pages(struct cache_sb *sb)
>> +{
>> +    unsigned int n, max_pages;
>> +
>> +    max_pages = min_t(unsigned int,
>> +              __rounddown_pow_of_two(USHRT_MAX) / PAGE_SECTORS,
>> +              MAX_ORDER_NR_PAGES);
>> +
>> +    n = sb->bucket_size / PAGE_SECTORS;
>> +    if (n > max_pages)
>> +        n = max_pages;
>> +
>> +    return n;
>> +}
>> +
>> +static inline unsigned int meta_bucket_bytes(struct cache_sb *sb)
>> +{
>> +    return meta_bucket_pages(sb) << PAGE_SHIFT;
>> +}
>> +
>>   #define prios_per_bucket(c)                \
>>       ((bucket_bytes(c) - sizeof(struct prio_set)) /    \
>>        sizeof(struct bucket_disk))
> I'm not particular happy with the division followed by a shift; might be
> an idea to replace the division by a shift.
> 
> But that's minor, so:
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>

This is from existing bcache code, let me think how to improve in future.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now
  2020-07-15 14:30 ` [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now colyli
@ 2020-07-15 18:21   ` Christoph Hellwig
  2020-07-16  3:31     ` Coly Li
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2020-07-15 18:21 UTC (permalink / raw)
  To: colyli; +Cc: linux-bcache, linux-block, Hannes Reinecke

On Wed, Jul 15, 2020 at 10:30:06PM +0800, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> We have struct cache_sb_disk for on-disk super block already, it is
> unnecessary to keep the in-memory super block format exactly mapping
> to the on-disk struct layout.
> 
> This patch adds code comments to notice that struct cache_sb is not
> exactly mapping to cache_sb_disk, and removes the useless member csum
> and pad[5].
> 
> Although struct cache_sb does not belong to uapi, but there are still
> some on-disk format related macros reference it and it is unncessary to
> get rid of such dependency now. So struct cache_sb will continue to stay
> in include/uapi/linux/bache.h for now.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

If you change this it really needs to move out of the uapi header.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now
  2020-07-15 18:21   ` Christoph Hellwig
@ 2020-07-16  3:31     ` Coly Li
  0 siblings, 0 replies; 33+ messages in thread
From: Coly Li @ 2020-07-16  3:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-bcache, linux-block, Hannes Reinecke

On 2020/7/16 02:21, Christoph Hellwig wrote:
> On Wed, Jul 15, 2020 at 10:30:06PM +0800, colyli@suse.de wrote:
>> From: Coly Li <colyli@suse.de>
>>
>> We have struct cache_sb_disk for on-disk super block already, it is
>> unnecessary to keep the in-memory super block format exactly mapping
>> to the on-disk struct layout.
>>
>> This patch adds code comments to notice that struct cache_sb is not
>> exactly mapping to cache_sb_disk, and removes the useless member csum
>> and pad[5].
>>
>> Although struct cache_sb does not belong to uapi, but there are still
>> some on-disk format related macros reference it and it is unncessary to
>> get rid of such dependency now. So struct cache_sb will continue to stay
>> in include/uapi/linux/bache.h for now.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
> 
> If you change this it really needs to move out of the uapi header.
> 
I see and I tried. It needs more work to cleanup the dependence, it will
be finished later.

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 11/16] bcache: handle cache set verify_ondisk properly for bucket size > 8MB
  2020-07-15 14:30 ` [PATCH v3 11/16] bcache: handle cache set verify_ondisk " colyli
@ 2020-07-16  6:07   ` Hannes Reinecke
  0 siblings, 0 replies; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  6:07 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> In bch_btree_cache_alloc() when CONFIG_BCACHE_DEBUG is configured,
> allocate memory for c->verify_ondisk may fail if the bucket size > 8MB,
> which will require __get_free_pages() to allocate continuous pages
> with order > 11 (the default MAX_ORDER of Linux buddy allocator). Such
> over size allocation will fail, and cause 2 problems,
> - When CONFIG_BCACHE_DEBUG is configured,  bch_btree_verify() does not
>    work, because c->verify_ondisk is NULL and bch_btree_verify() returns
>    immediately.
> - bch_btree_cache_alloc() will fail due to c->verify_ondisk allocation
>    failed, then the whole cache device registration fails. And because of
>    this failure, the first problem of bch_btree_verify() has no chance to
>    be triggered.
> 
> This patch fixes the above problem by two means,
> 1) If pages allocation of c->verify_ondisk fails, set it to NULL and
>     returns bch_btree_cache_alloc() with -ENOMEM.
> 2) When calling __get_free_pages() to allocate c->verify_ondisk pages,
>     use ilog2(meta_bucket_pages(&c->sb)) to make sure ilog2() will always
>     generate a pages order <= MAX_ORDER (or CONFIG_FORCE_MAX_ZONEORDER).
>     Then the buddy system won't directly reject the allocation request.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/btree.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 12/16] bcache: handle btree node memory allocation properly for bucket size > 8MB
  2020-07-15 14:30 ` [PATCH v3 12/16] bcache: handle btree node memory allocation " colyli
@ 2020-07-16  6:08   ` Hannes Reinecke
  0 siblings, 0 replies; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  6:08 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> Currently the bcache internal btree node occupies a whole bucket. When
> loading the btree node from cache device into memory, mca_data_alloc()
> will call bch_btree_keys_alloc() to allocate memory for the whole bucket
> size, ilog2(b->c->btree_pages) is send to bch_btree_keys_alloc() as the
> parameter 'page_order'.
> 
> c->btree_pages is set as bucket_pages() in bch_cache_set_alloc(), for
> bucket size > 8MB, ilog2(b->c->btree_pages) is 12 for 4KB page size. By
> default the maximum page order __get_free_pages() accepts is MAX_ORDER
> (11), in this condition bch_btree_keys_alloc() will always fail.
> 
> Because of other over-page-order allocation failure fails the cache
> device registration, such btree node allocation failure wasn't observed
> during runtime. After other blocking page allocation failures for bucket
> size > 8MB, this btree node allocation issue may trigger potentical risk
> e.g. infinite dead-loop to retry btree node allocation after failure.
> 
> This patch fixes the potential problem by setting c->btree_pages to
> meta_bucket_pages() in bch_cache_set_alloc(). In the condition that
> bucket size > 8MB, meta_bucket_pages() will always return a number which
> won't exceed the maximum page order of the buddy allocator.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b1d8388ef57d..02901d0ae8e2 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1867,7 +1867,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>   	c->nr_uuids		= meta_bucket_bytes(&c->sb) / sizeof(struct uuid_entry);
>   	c->devices_max_used	= 0;
>   	atomic_set(&c->attached_dev_nr, 0);
> -	c->btree_pages		= bucket_pages(c);
> +	c->btree_pages		= meta_bucket_pages(&c->sb);
>   	if (c->btree_pages > BTREE_MAX_PAGES)
>   		c->btree_pages = max_t(int, c->btree_pages / 4,
>   				       BTREE_MAX_PAGES);
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
  2020-07-15 14:30 ` [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket colyli
@ 2020-07-16  6:15   ` Hannes Reinecke
  2020-07-16  6:41     ` Coly Li
  0 siblings, 1 reply; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  6:15 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> The large bucket feature is to extend bucket_size from 16bit to 32bit.
> 
> When create cache device on zoned device (e.g. zoned NVMe SSD), making
> a single bucket cover one or more zones of the zoned device is the
> simplest way to support zoned device as cache by bcache.
> 
> But current maximum bucket size is 16MB and a typical zone size of zoned
> device is 256MB, this is the major motiviation to extend bucket size to
> a larger bit width.
> 
> This patch is the basic and first change to support large bucket size,
> the major changes it makes are,
> - Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
>    INCOMPAT means it introduces incompatible on-disk format change.
> - Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
> - Adds __le32 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
>    for the on-disk super block format.
> - For the in-memory super block struct cache_sb, member bucket_size is
>    extended from __u16 to __32.
> - Add get_bucket_size() to combine the bucket_size and bucket_size_hi
>    from struct cache_sb_disk into an unsigned int value.
> 
> Since we already have large bucket size helpers meta_bucket_pages(),
> meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
> bucket size > 8MB, the memory allocation for bcache meta data bucket
> won't fail no matter how large the bucket size extended. So these meta
> data buckets are handled properly when the bucket size width increase
> from 16bit to 32bit, we don't need to worry about them.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/alloc.c    |  2 +-
>   drivers/md/bcache/features.c | 22 ++++++++++++++++++++++
>   drivers/md/bcache/features.h |  9 ++++++---
>   drivers/md/bcache/movinggc.c |  4 ++--
>   drivers/md/bcache/super.c    | 23 +++++++++++++++++++----
>   include/uapi/linux/bcache.h  |  3 ++-
>   6 files changed, 52 insertions(+), 11 deletions(-)
>   create mode 100644 drivers/md/bcache/features.c
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index a1df0d95151c..52035a78d836 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -87,7 +87,7 @@ void bch_rescale_priorities(struct cache_set *c, int sectors)
>   {
>   	struct cache *ca;
>   	struct bucket *b;
> -	unsigned int next = c->nbuckets * c->sb.bucket_size / 1024;
> +	unsigned long next = c->nbuckets * c->sb.bucket_size / 1024;
>   	unsigned int i;
>   	int r;
>   
> diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
> new file mode 100644
> index 000000000000..ba53944bb390
> --- /dev/null
> +++ b/drivers/md/bcache/features.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Feature set bits and string conversion.
> + * Inspired by ext4's features compat/incompat/ro_compat related code.
> + *
> + * Copyright 2020 Coly Li <colyli@suse.de>
> + *
> + */
> +#include <linux/bcache.h>
> +#include "bcache.h"
> +
> +struct feature {
> +	int		compat;
> +	unsigned int	mask;
> +	const char	*string;
> +};
> +
> +static struct feature feature_list[] = {
> +	{BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
> +		"large_bucket"},
> +	{0, 0, 0 },
> +};
> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
> index ae7df37b9862..dca052cf5203 100644
> --- a/drivers/md/bcache/features.h
> +++ b/drivers/md/bcache/features.h
> @@ -11,9 +11,13 @@
>   #define BCH_FEATURE_INCOMPAT		2
>   #define BCH_FEATURE_TYPE_MASK		0x03
>   
> +/* Feature set definition */
> +/* Incompat feature set */
> +#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET	0x0001 /* 32bit bucket size */
> +
>   #define BCH_FEATURE_COMPAT_SUUP		0
>   #define BCH_FEATURE_RO_COMPAT_SUUP	0
> -#define BCH_FEATURE_INCOMPAT_SUUP	0
> +#define BCH_FEATURE_INCOMPAT_SUUP	BCH_FEATURE_INCOMPAT_LARGE_BUCKET
>   
>   #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
>   		((sb)->feature_compat & (mask))
> @@ -22,8 +26,6 @@
>   #define BCH_HAS_INCOMPAT_FEATURE(sb, mask) \
>   		((sb)->feature_incompat & (mask))
>   
> -/* Feature set definition */
> -
>   #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
>   static inline int bch_has_feature_##name(struct cache_sb *sb) \
>   { \
> @@ -75,4 +77,5 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
>   		~BCH##_FEATURE_INCOMPAT_##flagname; \
>   }
>   
> +BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
>   #endif
> diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
> index b7dd2d75f58c..5872d6470470 100644
> --- a/drivers/md/bcache/movinggc.c
> +++ b/drivers/md/bcache/movinggc.c
> @@ -206,8 +206,8 @@ void bch_moving_gc(struct cache_set *c)
>   	mutex_lock(&c->bucket_lock);
>   
>   	for_each_cache(ca, c, i) {
> -		unsigned int sectors_to_move = 0;
> -		unsigned int reserve_sectors = ca->sb.bucket_size *
> +		unsigned long sectors_to_move = 0;
> +		unsigned long reserve_sectors = ca->sb.bucket_size *
>   			     fifo_used(&ca->free[RESERVE_MOVINGGC]);
>   
>   		ca->heap.used = 0;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 02901d0ae8e2..e0da52f8e8c9 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -60,6 +60,17 @@ struct workqueue_struct *bch_journal_wq;
>   
>   /* Superblock */
>   
> +static unsigned int get_bucket_size(struct cache_sb *sb, struct cache_sb_disk *s)
> +{
> +	unsigned int bucket_size = le16_to_cpu(s->bucket_size);
> +
> +	if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
> +	     bch_has_feature_large_bucket(sb))
> +		bucket_size |= le32_to_cpu(s->bucket_size_hi) << 16;
> +
> +	return bucket_size;
> +}
> +
>   static const char *read_super_common(struct cache_sb *sb,  struct block_device *bdev,
>   				     struct cache_sb_disk *s)
>   {
That is a bit unfortunate; bucket_size_hi is 32 bits, so we might end up 
with an overflow here if bucket_size_hi is larger that USHRT_MAX.

So to avoid overflow either make bucket_size_hi 16 bit, too, or define 
this feature such that the original bucket_size field is ignored and 
just the new size field is used.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set
  2020-07-15 14:30 ` [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set colyli
@ 2020-07-16  6:17   ` Hannes Reinecke
  2020-07-16  6:20     ` Coly Li
  0 siblings, 1 reply; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  6:17 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> A new sysfs file /sys/fs/bcache/<cache set UUID>/internal/feature_sets
> is added by this patch, to display feature sets information of the cache
> set.
> 
> Now only an incompat feature 'large_bucket' added in bcache, the sysfs
> file content is:
> 	feature_incompat: [large_bucket]
> string large_bucket means the running bcache drive supports incompat
> feature 'large_bucket', the wrapping [] means the 'large_bucket' feature
> is currently enabled on this cache set.
> 
> This patch is ready to display compat and ro_compat features, in future
> once bcache code implements such feature sets, the according feature
> strings will be displayed in this sysfs file too.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/Makefile   |  2 +-
>   drivers/md/bcache/features.c | 48 ++++++++++++++++++++++++++++++++++++
>   drivers/md/bcache/features.h |  3 +++
>   drivers/md/bcache/sysfs.c    |  6 +++++
>   4 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
> index fd714628da6a..5b87e59676b8 100644
> --- a/drivers/md/bcache/Makefile
> +++ b/drivers/md/bcache/Makefile
> @@ -4,4 +4,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
>   
>   bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
>   	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
> -	util.o writeback.o
> +	util.o writeback.o features.o
> diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
> index ba53944bb390..5c601635e11c 100644
> --- a/drivers/md/bcache/features.c
> +++ b/drivers/md/bcache/features.c
> @@ -8,6 +8,7 @@
>    */
>   #include <linux/bcache.h>
>   #include "bcache.h"
> +#include "features.h"
>   
>   struct feature {
>   	int		compat;
> @@ -20,3 +21,50 @@ static struct feature feature_list[] = {
>   		"large_bucket"},
>   	{0, 0, 0 },
>   };
> +
> +#define compose_feature_string(type, head)				\
> +({									\
> +	struct feature *f;						\
> +	bool first = true;						\
> +									\
> +	for (f = &feature_list[0]; f->compat != 0; f++) {		\
> +		if (f->compat != BCH_FEATURE_ ## type)			\
> +			continue;					\
> +		if (BCH_HAS_ ## type ## _FEATURE(&c->sb, f->mask)) {	\
> +			if (first) {					\
> +				out += snprintf(out, buf + size - out,	\
> +						"%s%s", head, ": [");	\
> +			} else {					\
> +				out += snprintf(out, buf + size - out,	\
> +						" [");			\
> +			}						\
> +		} else {						\
> +			if (first)					\
> +				out += snprintf(out, buf + size - out,	\
> +						"%s%s", head, ": ");	\
> +			else						\
> +				out += snprintf(out, buf + size - out,	\
> +						" ");			\
> +		}							\
> +									\
> +		out += snprintf(out, buf + size - out, "%s", f->string);\
> +									\
> +		if (BCH_HAS_ ## type ## _FEATURE(&c->sb, f->mask))	\
> +			out += snprintf(out, buf + size - out, "]");	\
> +									\
> +		first = false;						\
> +	}								\
> +	if (!first)							\
> +		out += snprintf(out, buf + size - out, "\n");		\
> +})
> +
> +int bch_print_cache_set_feature_set(struct cache_set *c, char *buf, int size)
> +{
> +	char *out = buf;
> +
> +	compose_feature_string(COMPAT, "feature_compat");
> +	compose_feature_string(RO_COMPAT, "feature_ro_compat");
> +	compose_feature_string(INCOMPAT, "feature_incompat");
> +
> +	return out - buf;
> +}
> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
> index dca052cf5203..350a4f413136 100644
> --- a/drivers/md/bcache/features.h
> +++ b/drivers/md/bcache/features.h
> @@ -78,4 +78,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
>   }
>   
>   BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
> +
> +int bch_print_cache_set_feature_set(struct cache_set *c, char *buf, int size);
> +
>   #endif
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 0dadec5a78f6..e3633f06d43b 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -11,6 +11,7 @@
>   #include "btree.h"
>   #include "request.h"
>   #include "writeback.h"
> +#include "features.h"
>   
>   #include <linux/blkdev.h>
>   #include <linux/sort.h>
> @@ -88,6 +89,7 @@ read_attribute(btree_used_percent);
>   read_attribute(average_key_size);
>   read_attribute(dirty_data);
>   read_attribute(bset_tree_stats);
> +read_attribute(feature_sets);
>   
>   read_attribute(state);
>   read_attribute(cache_read_races);
> @@ -779,6 +781,9 @@ SHOW(__bch_cache_set)
>   	if (attr == &sysfs_bset_tree_stats)
>   		return bch_bset_print_stats(c, buf);
>   
> +	if (attr == &sysfs_feature_sets)
> +		return bch_print_cache_set_feature_set(c, buf, PAGE_SIZE);
> +
>   	return 0;
>   }
>   SHOW_LOCKED(bch_cache_set)
> @@ -987,6 +992,7 @@ static struct attribute *bch_cache_set_internal_files[] = {
>   	&sysfs_io_disable,
>   	&sysfs_cutoff_writeback,
>   	&sysfs_cutoff_writeback_sync,
> +	&sysfs_feature_sets,
>   	NULL
>   };
>   KTYPE(bch_cache_set_internal);
> 
Tsk.

sysfs attributes should be one value only, not some string declaring 
several things at once.
Why not adding two attributes (features_compat and features_incompat), 
listing each feature?
That would even easier to implement.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 15/16] bcache: avoid extra memory allocation from mempool c->fill_iter
  2020-07-15 14:30 ` [PATCH v3 15/16] bcache: avoid extra memory allocation from mempool c->fill_iter colyli
@ 2020-07-16  6:18   ` Hannes Reinecke
  0 siblings, 0 replies; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  6:18 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> Mempool c->fill_iter is used to allocate memory for struct btree_iter in
> bch_btree_node_read_done() to iterate all keys of a read-in btree node.
> 
> The allocation size is defined in bch_cache_set_alloc() by,
>    mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
> where iter_size is defined by a calculation,
>    (sb->bucket_size / sb->block_size + 1) * sizeof(struct btree_iter_set)
> 
> For 16bit width bucket_size the calculation is OK, but now the bucket
> size is extended to 32bit, the bucket size can be 2GB. By the above
> calculation, iter_size can be 2048 pages (order 11 is still accepted by
> buddy allocator).
> 
> But the actual size holds the bkeys in meta data bucket is limited to
> meta_bucket_pages() already, which is 16MB. By the above calculation,
> if replace sb->bucket_size by meta_bucket_pages() * PAGE_SECTORS, the
> result is 16 pages. This is the size large enough for the mempool
> allocation to struct btree_iter.
> 
> Therefore in worst case every time mempool c->fill_iter allocates, at
> most 4080 pages are wasted and won't be used. Therefore this patch uses
> meta_bucket_pages() * PAGE_SECTORS to calculate the iter size in
> bch_cache_set_alloc(), to avoid extra memory allocation from mempool
> c->fill_iter.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e0da52f8e8c9..90494c7dead8 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1908,7 +1908,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>   	INIT_LIST_HEAD(&c->btree_cache_freed);
>   	INIT_LIST_HEAD(&c->data_buckets);
>   
> -	iter_size = (sb->bucket_size / sb->block_size + 1) *
> +	iter_size = ((meta_bucket_pages(sb) * PAGE_SECTORS) / sb->block_size + 1) *
>   		sizeof(struct btree_iter_set);
>   
>   	c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL);
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 16/16] bcache: avoid extra memory consumption in struct bbio for large bucket size
  2020-07-15 14:30 ` [PATCH v3 16/16] bcache: avoid extra memory consumption in struct bbio for large bucket size colyli
@ 2020-07-16  6:18   ` Hannes Reinecke
  0 siblings, 0 replies; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  6:18 UTC (permalink / raw)
  To: colyli, linux-bcache; +Cc: linux-block

On 7/15/20 4:30 PM, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
> 
> Bcache uses struct bbio to do I/Os for meta data pages like uuids,
> disk_buckets, prio_buckets, and btree nodes.
> 
> Example writing a btree node onto cache device, the process is,
> - Allocate a struct bbio from mempool c->bio_meta.
> - Inside struct bbio embedded a struct bio, initialize bi_inline_vecs
>    for this embedded bio.
> - Call bch_bio_map() to map each meta data page to each bv from the
>    inlined  bi_io_vec table.
> - Call bch_submit_bbio() to submit the bio into underlying block layer.
> - When the I/O completed, only release the struct bbio, don't touch the
>    reference counter of the meta data pages.
> 
> The struct bbio is defined as,
> 738 struct bbio {
> 739     unsigned int            submit_time_us;
> 	[snipped]
> 748     struct bio              bio;
> 749 };
> 
> Because struct bio is embedded at the end of struct bbio, therefore the
> actual size of struct bbio is sizeof(struct bio) + size of the embedded
> bio->bi_inline_vecs.
> 
> Now all the meta data bucket size are limited to meta_bucket_pages(), if
> the bucket size is large than meta_bucket_pages()*PAGE_SECTORS, rested
> space in the bucket is unused. Therefore the most used space in meta
> bucket is (1<<MAX_ORDER) pages, or (1<<CONFIG_FORCE_MAX_ZONEORDER) if it
> is configured.
> 
> Therefore for large bucket size, it is unnecessary to calculate the
> allocation size of mempool c->bio_meta as,
> 	mempool_init_kmalloc_pool(&c->bio_meta, 2,
> 			sizeof(struct bbio) +
> 			sizeof(struct bio_vec) * bucket_pages(c))
> It is too large, neither the Linux buddy allocator cannot allocate so
> much continuous pages, nor the extra allocated pages are wasted.
> 
> This patch replace bucket_pages() to meta_bucket_pages() in two places,
> - In bch_cache_set_alloc(), when initialize mempool c->bio_meta, uses
>    sizeof(struct bbio) + sizeof(struct bio_vec) * bucket_pages(c) to set
>    the allocating object size.
> - In bch_bbio_alloc(), when calling bio_init() to set inline bvec talbe
>    bi_inline_bvecs, uses meta_bucket_pages() to indicate number of the
>    inline bio vencs number.
> 
> Now the maximum size of embedded bio inside struct bbio exactly matches
> the limit of meta_bucket_pages(), no extra page wasted.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/io.c    | 2 +-
>   drivers/md/bcache/super.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index b25ee33b0d0b..a14a445618b4 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -26,7 +26,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
>   	struct bbio *b = mempool_alloc(&c->bio_meta, GFP_NOIO);
>   	struct bio *bio = &b->bio;
>   
> -	bio_init(bio, bio->bi_inline_vecs, bucket_pages(c));
> +	bio_init(bio, bio->bi_inline_vecs, meta_bucket_pages(&c->sb));
>   
>   	return bio;
>   }
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 90494c7dead8..cade3f09661d 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1920,7 +1920,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>   
>   	if (mempool_init_kmalloc_pool(&c->bio_meta, 2,
>   			sizeof(struct bbio) +
> -			sizeof(struct bio_vec) * bucket_pages(c)))
> +			sizeof(struct bio_vec) * meta_bucket_pages(&c->sb)))
>   		goto err;
>   
>   	if (mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set
  2020-07-16  6:17   ` Hannes Reinecke
@ 2020-07-16  6:20     ` Coly Li
  0 siblings, 0 replies; 33+ messages in thread
From: Coly Li @ 2020-07-16  6:20 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2020/7/16 14:17, Hannes Reinecke wrote:
> On 7/15/20 4:30 PM, colyli@suse.de wrote:
>> From: Coly Li <colyli@suse.de>
>>
>> A new sysfs file /sys/fs/bcache/<cache set UUID>/internal/feature_sets
>> is added by this patch, to display feature sets information of the cache
>> set.
>>
>> Now only an incompat feature 'large_bucket' added in bcache, the sysfs
>> file content is:
>>     feature_incompat: [large_bucket]
>> string large_bucket means the running bcache drive supports incompat
>> feature 'large_bucket', the wrapping [] means the 'large_bucket' feature
>> is currently enabled on this cache set.
>>
>> This patch is ready to display compat and ro_compat features, in future
>> once bcache code implements such feature sets, the according feature
>> strings will be displayed in this sysfs file too.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/Makefile   |  2 +-
>>   drivers/md/bcache/features.c | 48 ++++++++++++++++++++++++++++++++++++
>>   drivers/md/bcache/features.h |  3 +++
>>   drivers/md/bcache/sysfs.c    |  6 +++++
>>   4 files changed, 58 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
>> index fd714628da6a..5b87e59676b8 100644
>> --- a/drivers/md/bcache/Makefile
>> +++ b/drivers/md/bcache/Makefile
>> @@ -4,4 +4,4 @@ obj-$(CONFIG_BCACHE)    += bcache.o
>>     bcache-y        := alloc.o bset.o btree.o closure.o debug.o
>> extents.o\
>>       io.o journal.o movinggc.o request.o stats.o super.o sysfs.o
>> trace.o\
>> -    util.o writeback.o
>> +    util.o writeback.o features.o
>> diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
>> index ba53944bb390..5c601635e11c 100644
>> --- a/drivers/md/bcache/features.c
>> +++ b/drivers/md/bcache/features.c
>> @@ -8,6 +8,7 @@
>>    */
>>   #include <linux/bcache.h>
>>   #include "bcache.h"
>> +#include "features.h"
>>     struct feature {
>>       int        compat;
>> @@ -20,3 +21,50 @@ static struct feature feature_list[] = {
>>           "large_bucket"},
>>       {0, 0, 0 },
>>   };
>> +
>> +#define compose_feature_string(type, head)                \
>> +({                                    \
>> +    struct feature *f;                        \
>> +    bool first = true;                        \
>> +                                    \
>> +    for (f = &feature_list[0]; f->compat != 0; f++) {        \
>> +        if (f->compat != BCH_FEATURE_ ## type)            \
>> +            continue;                    \
>> +        if (BCH_HAS_ ## type ## _FEATURE(&c->sb, f->mask)) {    \
>> +            if (first) {                    \
>> +                out += snprintf(out, buf + size - out,    \
>> +                        "%s%s", head, ": [");    \
>> +            } else {                    \
>> +                out += snprintf(out, buf + size - out,    \
>> +                        " [");            \
>> +            }                        \
>> +        } else {                        \
>> +            if (first)                    \
>> +                out += snprintf(out, buf + size - out,    \
>> +                        "%s%s", head, ": ");    \
>> +            else                        \
>> +                out += snprintf(out, buf + size - out,    \
>> +                        " ");            \
>> +        }                            \
>> +                                    \
>> +        out += snprintf(out, buf + size - out, "%s", f->string);\
>> +                                    \
>> +        if (BCH_HAS_ ## type ## _FEATURE(&c->sb, f->mask))    \
>> +            out += snprintf(out, buf + size - out, "]");    \
>> +                                    \
>> +        first = false;                        \
>> +    }                                \
>> +    if (!first)                            \
>> +        out += snprintf(out, buf + size - out, "\n");        \
>> +})
>> +
>> +int bch_print_cache_set_feature_set(struct cache_set *c, char *buf,
>> int size)
>> +{
>> +    char *out = buf;
>> +
>> +    compose_feature_string(COMPAT, "feature_compat");
>> +    compose_feature_string(RO_COMPAT, "feature_ro_compat");
>> +    compose_feature_string(INCOMPAT, "feature_incompat");
>> +
>> +    return out - buf;
>> +}
>> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
>> index dca052cf5203..350a4f413136 100644
>> --- a/drivers/md/bcache/features.h
>> +++ b/drivers/md/bcache/features.h
>> @@ -78,4 +78,7 @@ static inline void bch_clear_feature_##name(struct
>> cache_sb *sb) \
>>   }
>>     BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
>> +
>> +int bch_print_cache_set_feature_set(struct cache_set *c, char *buf,
>> int size);
>> +
>>   #endif
>> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>> index 0dadec5a78f6..e3633f06d43b 100644
>> --- a/drivers/md/bcache/sysfs.c
>> +++ b/drivers/md/bcache/sysfs.c
>> @@ -11,6 +11,7 @@
>>   #include "btree.h"
>>   #include "request.h"
>>   #include "writeback.h"
>> +#include "features.h"
>>     #include <linux/blkdev.h>
>>   #include <linux/sort.h>
>> @@ -88,6 +89,7 @@ read_attribute(btree_used_percent);
>>   read_attribute(average_key_size);
>>   read_attribute(dirty_data);
>>   read_attribute(bset_tree_stats);
>> +read_attribute(feature_sets);
>>     read_attribute(state);
>>   read_attribute(cache_read_races);
>> @@ -779,6 +781,9 @@ SHOW(__bch_cache_set)
>>       if (attr == &sysfs_bset_tree_stats)
>>           return bch_bset_print_stats(c, buf);
>>   +    if (attr == &sysfs_feature_sets)
>> +        return bch_print_cache_set_feature_set(c, buf, PAGE_SIZE);
>> +
>>       return 0;
>>   }
>>   SHOW_LOCKED(bch_cache_set)
>> @@ -987,6 +992,7 @@ static struct attribute
>> *bch_cache_set_internal_files[] = {
>>       &sysfs_io_disable,
>>       &sysfs_cutoff_writeback,
>>       &sysfs_cutoff_writeback_sync,
>> +    &sysfs_feature_sets,
>>       NULL
>>   };
>>   KTYPE(bch_cache_set_internal);
>>
> Tsk.
> 
> sysfs attributes should be one value only, not some string declaring
> several things at once.
> Why not adding two attributes (features_compat and features_incompat),
> listing each feature?
> That would even easier to implement.

Copied, will fix it in next version. Thanks.

Coly Li



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
  2020-07-16  6:15   ` Hannes Reinecke
@ 2020-07-16  6:41     ` Coly Li
  2020-07-16  7:02       ` Hannes Reinecke
  0 siblings, 1 reply; 33+ messages in thread
From: Coly Li @ 2020-07-16  6:41 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2020/7/16 14:15, Hannes Reinecke wrote:
> On 7/15/20 4:30 PM, colyli@suse.de wrote:
>> From: Coly Li <colyli@suse.de>
>>
>> The large bucket feature is to extend bucket_size from 16bit to 32bit.
>>
>> When create cache device on zoned device (e.g. zoned NVMe SSD), making
>> a single bucket cover one or more zones of the zoned device is the
>> simplest way to support zoned device as cache by bcache.
>>
>> But current maximum bucket size is 16MB and a typical zone size of zoned
>> device is 256MB, this is the major motiviation to extend bucket size to
>> a larger bit width.
>>
>> This patch is the basic and first change to support large bucket size,
>> the major changes it makes are,
>> - Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
>>    INCOMPAT means it introduces incompatible on-disk format change.
>> - Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
>> - Adds __le32 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
>>    for the on-disk super block format.
>> - For the in-memory super block struct cache_sb, member bucket_size is
>>    extended from __u16 to __32.
>> - Add get_bucket_size() to combine the bucket_size and bucket_size_hi
>>    from struct cache_sb_disk into an unsigned int value.
>>
>> Since we already have large bucket size helpers meta_bucket_pages(),
>> meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
>> bucket size > 8MB, the memory allocation for bcache meta data bucket
>> won't fail no matter how large the bucket size extended. So these meta
>> data buckets are handled properly when the bucket size width increase
>> from 16bit to 32bit, we don't need to worry about them.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/alloc.c    |  2 +-
>>   drivers/md/bcache/features.c | 22 ++++++++++++++++++++++
>>   drivers/md/bcache/features.h |  9 ++++++---
>>   drivers/md/bcache/movinggc.c |  4 ++--
>>   drivers/md/bcache/super.c    | 23 +++++++++++++++++++----
>>   include/uapi/linux/bcache.h  |  3 ++-
>>   6 files changed, 52 insertions(+), 11 deletions(-)
>>   create mode 100644 drivers/md/bcache/features.c
>>
>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>> index a1df0d95151c..52035a78d836 100644
>> --- a/drivers/md/bcache/alloc.c
>> +++ b/drivers/md/bcache/alloc.c
>> @@ -87,7 +87,7 @@ void bch_rescale_priorities(struct cache_set *c, int
>> sectors)
>>   {
>>       struct cache *ca;
>>       struct bucket *b;
>> -    unsigned int next = c->nbuckets * c->sb.bucket_size / 1024;
>> +    unsigned long next = c->nbuckets * c->sb.bucket_size / 1024;
>>       unsigned int i;
>>       int r;
>>   diff --git a/drivers/md/bcache/features.c
>> b/drivers/md/bcache/features.c
>> new file mode 100644
>> index 000000000000..ba53944bb390
>> --- /dev/null
>> +++ b/drivers/md/bcache/features.c
>> @@ -0,0 +1,22 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Feature set bits and string conversion.
>> + * Inspired by ext4's features compat/incompat/ro_compat related code.
>> + *
>> + * Copyright 2020 Coly Li <colyli@suse.de>
>> + *
>> + */
>> +#include <linux/bcache.h>
>> +#include "bcache.h"
>> +
>> +struct feature {
>> +    int        compat;
>> +    unsigned int    mask;
>> +    const char    *string;
>> +};
>> +
>> +static struct feature feature_list[] = {
>> +    {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
>> +        "large_bucket"},
>> +    {0, 0, 0 },
>> +};
>> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
>> index ae7df37b9862..dca052cf5203 100644
>> --- a/drivers/md/bcache/features.h
>> +++ b/drivers/md/bcache/features.h
>> @@ -11,9 +11,13 @@
>>   #define BCH_FEATURE_INCOMPAT        2
>>   #define BCH_FEATURE_TYPE_MASK        0x03
>>   +/* Feature set definition */
>> +/* Incompat feature set */
>> +#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET    0x0001 /* 32bit bucket
>> size */
>> +
>>   #define BCH_FEATURE_COMPAT_SUUP        0
>>   #define BCH_FEATURE_RO_COMPAT_SUUP    0
>> -#define BCH_FEATURE_INCOMPAT_SUUP    0
>> +#define BCH_FEATURE_INCOMPAT_SUUP    BCH_FEATURE_INCOMPAT_LARGE_BUCKET
>>     #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
>>           ((sb)->feature_compat & (mask))
>> @@ -22,8 +26,6 @@
>>   #define BCH_HAS_INCOMPAT_FEATURE(sb, mask) \
>>           ((sb)->feature_incompat & (mask))
>>   -/* Feature set definition */
>> -
>>   #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
>>   static inline int bch_has_feature_##name(struct cache_sb *sb) \
>>   { \
>> @@ -75,4 +77,5 @@ static inline void bch_clear_feature_##name(struct
>> cache_sb *sb) \
>>           ~BCH##_FEATURE_INCOMPAT_##flagname; \
>>   }
>>   +BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
>>   #endif
>> diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
>> index b7dd2d75f58c..5872d6470470 100644
>> --- a/drivers/md/bcache/movinggc.c
>> +++ b/drivers/md/bcache/movinggc.c
>> @@ -206,8 +206,8 @@ void bch_moving_gc(struct cache_set *c)
>>       mutex_lock(&c->bucket_lock);
>>         for_each_cache(ca, c, i) {
>> -        unsigned int sectors_to_move = 0;
>> -        unsigned int reserve_sectors = ca->sb.bucket_size *
>> +        unsigned long sectors_to_move = 0;
>> +        unsigned long reserve_sectors = ca->sb.bucket_size *
>>                    fifo_used(&ca->free[RESERVE_MOVINGGC]);
>>             ca->heap.used = 0;
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 02901d0ae8e2..e0da52f8e8c9 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -60,6 +60,17 @@ struct workqueue_struct *bch_journal_wq;
>>     /* Superblock */
>>   +static unsigned int get_bucket_size(struct cache_sb *sb, struct
>> cache_sb_disk *s)
>> +{
>> +    unsigned int bucket_size = le16_to_cpu(s->bucket_size);
>> +
>> +    if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
>> +         bch_has_feature_large_bucket(sb))
>> +        bucket_size |= le32_to_cpu(s->bucket_size_hi) << 16;
>> +
>> +    return bucket_size;
>> +}
>> +
>>   static const char *read_super_common(struct cache_sb *sb,  struct
>> block_device *bdev,
>>                        struct cache_sb_disk *s)
>>   {
> That is a bit unfortunate; bucket_size_hi is 32 bits, so we might end up
> with an overflow here if bucket_size_hi is larger that USHRT_MAX.
> 

In bcache-tools, the maximum value of bucket_size is restricted to
UINT_MAX in make_bcache(),
	case 'b':
		bucket_size =
			hatoi_validate(optarg, "bucket size", UINT_MAX);
		break;

So the overflow won't happen if people use bcache program to make the
cache device.

If people want to modify bucket_size_hi themselves, in
read_super_common(), there are several checks to make sure the hacked
cache_sb->bucket_size composed from the hacked bucket_size_hi won't mess
up whole kernel,
1) should be power of 2
	if (!is_power_of_2(sb->bucket_size))
2) should large than page size
	if (sb->bucket_size < PAGE_SECTORS)
3) should match sb->nbuckets
	if (get_capacity(bdev->bd_disk) <
	    sb->bucket_size * sb->nbuckets)
4) no overlap with super block
	if (sb->first_bucket * sb->bucket_size < 16)

Therefore the overflow won't happen, and even happen by hacking it won't
panic kernel.

> So to avoid overflow either make bucket_size_hi 16 bit, too, or define
> this feature such that the original bucket_size field is ignored and
> just the new size field is used.
> 

Now cache_sb_disk is for on-disk format, it contains 16bit bucket_size
and 32bit bucket_size_hi. cache_sb is for in-memory object only, it
simply has a 32bit bucket_size.

Indeed I planed to set cache_sb->bucket_size to be uint64_t, but I feel
people won't use bucket size > 1TB, and finally make it to be a 32bit value.

I will add more code comments to explain why there won't be overflow in
get_bucket_size() because user space tool limits the maximum size.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
  2020-07-16  6:41     ` Coly Li
@ 2020-07-16  7:02       ` Hannes Reinecke
  2020-07-16  7:08         ` Coly Li
  0 siblings, 1 reply; 33+ messages in thread
From: Hannes Reinecke @ 2020-07-16  7:02 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 7/16/20 8:41 AM, Coly Li wrote:
> On 2020/7/16 14:15, Hannes Reinecke wrote:
>> On 7/15/20 4:30 PM, colyli@suse.de wrote:
>>> From: Coly Li <colyli@suse.de>
>>>
>>> The large bucket feature is to extend bucket_size from 16bit to 32bit.
>>>
>>> When create cache device on zoned device (e.g. zoned NVMe SSD), making
>>> a single bucket cover one or more zones of the zoned device is the
>>> simplest way to support zoned device as cache by bcache.
>>>
>>> But current maximum bucket size is 16MB and a typical zone size of zoned
>>> device is 256MB, this is the major motiviation to extend bucket size to
>>> a larger bit width.
>>>
>>> This patch is the basic and first change to support large bucket size,
>>> the major changes it makes are,
>>> - Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
>>>     INCOMPAT means it introduces incompatible on-disk format change.
>>> - Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
>>> - Adds __le32 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
>>>     for the on-disk super block format.
>>> - For the in-memory super block struct cache_sb, member bucket_size is
>>>     extended from __u16 to __32.
>>> - Add get_bucket_size() to combine the bucket_size and bucket_size_hi
>>>     from struct cache_sb_disk into an unsigned int value.
>>>
>>> Since we already have large bucket size helpers meta_bucket_pages(),
>>> meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
>>> bucket size > 8MB, the memory allocation for bcache meta data bucket
>>> won't fail no matter how large the bucket size extended. So these meta
>>> data buckets are handled properly when the bucket size width increase
>>> from 16bit to 32bit, we don't need to worry about them.
>>>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> ---
>>>    drivers/md/bcache/alloc.c    |  2 +-
>>>    drivers/md/bcache/features.c | 22 ++++++++++++++++++++++
>>>    drivers/md/bcache/features.h |  9 ++++++---
>>>    drivers/md/bcache/movinggc.c |  4 ++--
>>>    drivers/md/bcache/super.c    | 23 +++++++++++++++++++----
>>>    include/uapi/linux/bcache.h  |  3 ++-
>>>    6 files changed, 52 insertions(+), 11 deletions(-)
>>>    create mode 100644 drivers/md/bcache/features.c
>>>
>>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>>> index a1df0d95151c..52035a78d836 100644
>>> --- a/drivers/md/bcache/alloc.c
>>> +++ b/drivers/md/bcache/alloc.c
>>> @@ -87,7 +87,7 @@ void bch_rescale_priorities(struct cache_set *c, int
>>> sectors)
>>>    {
>>>        struct cache *ca;
>>>        struct bucket *b;
>>> -    unsigned int next = c->nbuckets * c->sb.bucket_size / 1024;
>>> +    unsigned long next = c->nbuckets * c->sb.bucket_size / 1024;
>>>        unsigned int i;
>>>        int r;
>>>    diff --git a/drivers/md/bcache/features.c
>>> b/drivers/md/bcache/features.c
>>> new file mode 100644
>>> index 000000000000..ba53944bb390
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/features.c
>>> @@ -0,0 +1,22 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Feature set bits and string conversion.
>>> + * Inspired by ext4's features compat/incompat/ro_compat related code.
>>> + *
>>> + * Copyright 2020 Coly Li <colyli@suse.de>
>>> + *
>>> + */
>>> +#include <linux/bcache.h>
>>> +#include "bcache.h"
>>> +
>>> +struct feature {
>>> +    int        compat;
>>> +    unsigned int    mask;
>>> +    const char    *string;
>>> +};
>>> +
>>> +static struct feature feature_list[] = {
>>> +    {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
>>> +        "large_bucket"},
>>> +    {0, 0, 0 },
>>> +};
>>> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
>>> index ae7df37b9862..dca052cf5203 100644
>>> --- a/drivers/md/bcache/features.h
>>> +++ b/drivers/md/bcache/features.h
>>> @@ -11,9 +11,13 @@
>>>    #define BCH_FEATURE_INCOMPAT        2
>>>    #define BCH_FEATURE_TYPE_MASK        0x03
>>>    +/* Feature set definition */
>>> +/* Incompat feature set */
>>> +#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET    0x0001 /* 32bit bucket
>>> size */
>>> +
>>>    #define BCH_FEATURE_COMPAT_SUUP        0
>>>    #define BCH_FEATURE_RO_COMPAT_SUUP    0
>>> -#define BCH_FEATURE_INCOMPAT_SUUP    0
>>> +#define BCH_FEATURE_INCOMPAT_SUUP    BCH_FEATURE_INCOMPAT_LARGE_BUCKET
>>>      #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
>>>            ((sb)->feature_compat & (mask))
>>> @@ -22,8 +26,6 @@
>>>    #define BCH_HAS_INCOMPAT_FEATURE(sb, mask) \
>>>            ((sb)->feature_incompat & (mask))
>>>    -/* Feature set definition */
>>> -
>>>    #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
>>>    static inline int bch_has_feature_##name(struct cache_sb *sb) \
>>>    { \
>>> @@ -75,4 +77,5 @@ static inline void bch_clear_feature_##name(struct
>>> cache_sb *sb) \
>>>            ~BCH##_FEATURE_INCOMPAT_##flagname; \
>>>    }
>>>    +BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
>>>    #endif
>>> diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
>>> index b7dd2d75f58c..5872d6470470 100644
>>> --- a/drivers/md/bcache/movinggc.c
>>> +++ b/drivers/md/bcache/movinggc.c
>>> @@ -206,8 +206,8 @@ void bch_moving_gc(struct cache_set *c)
>>>        mutex_lock(&c->bucket_lock);
>>>          for_each_cache(ca, c, i) {
>>> -        unsigned int sectors_to_move = 0;
>>> -        unsigned int reserve_sectors = ca->sb.bucket_size *
>>> +        unsigned long sectors_to_move = 0;
>>> +        unsigned long reserve_sectors = ca->sb.bucket_size *
>>>                     fifo_used(&ca->free[RESERVE_MOVINGGC]);
>>>              ca->heap.used = 0;
>>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>> index 02901d0ae8e2..e0da52f8e8c9 100644
>>> --- a/drivers/md/bcache/super.c
>>> +++ b/drivers/md/bcache/super.c
>>> @@ -60,6 +60,17 @@ struct workqueue_struct *bch_journal_wq;
>>>      /* Superblock */
>>>    +static unsigned int get_bucket_size(struct cache_sb *sb, struct
>>> cache_sb_disk *s)
>>> +{
>>> +    unsigned int bucket_size = le16_to_cpu(s->bucket_size);
>>> +
>>> +    if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
>>> +         bch_has_feature_large_bucket(sb))
>>> +        bucket_size |= le32_to_cpu(s->bucket_size_hi) << 16;
>>> +
>>> +    return bucket_size;
>>> +}
>>> +
>>>    static const char *read_super_common(struct cache_sb *sb,  struct
>>> block_device *bdev,
>>>                         struct cache_sb_disk *s)
>>>    {
>> That is a bit unfortunate; bucket_size_hi is 32 bits, so we might end up
>> with an overflow here if bucket_size_hi is larger that USHRT_MAX.
>>
> 
> In bcache-tools, the maximum value of bucket_size is restricted to
> UINT_MAX in make_bcache(),
> 	case 'b':
> 		bucket_size =
> 			hatoi_validate(optarg, "bucket size", UINT_MAX);
> 		break;
> 
> So the overflow won't happen if people use bcache program to make the
> cache device.
> 
> If people want to modify bucket_size_hi themselves, in
> read_super_common(), there are several checks to make sure the hacked
> cache_sb->bucket_size composed from the hacked bucket_size_hi won't mess
> up whole kernel,
> 1) should be power of 2
> 	if (!is_power_of_2(sb->bucket_size))
> 2) should large than page size
> 	if (sb->bucket_size < PAGE_SECTORS)
> 3) should match sb->nbuckets
> 	if (get_capacity(bdev->bd_disk) <
> 	    sb->bucket_size * sb->nbuckets)
> 4) no overlap with super block
> 	if (sb->first_bucket * sb->bucket_size < 16)
> 
> Therefore the overflow won't happen, and even happen by hacking it won't
> panic kernel.
> 
>> So to avoid overflow either make bucket_size_hi 16 bit, too, or define
>> this feature such that the original bucket_size field is ignored and
>> just the new size field is used.
>>
> 
> Now cache_sb_disk is for on-disk format, it contains 16bit bucket_size
> and 32bit bucket_size_hi. cache_sb is for in-memory object only, it
> simply has a 32bit bucket_size.
> 
> Indeed I planed to set cache_sb->bucket_size to be uint64_t, but I feel
> people won't use bucket size > 1TB, and finally make it to be a 32bit value.
> 
> I will add more code comments to explain why there won't be overflow in
> get_bucket_size() because user space tool limits the maximum size.
> 
So why not make bucket_size_hi 16-bit?
That would clear up the ambiguity, and you wouldn't need to add comments ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
  2020-07-16  7:02       ` Hannes Reinecke
@ 2020-07-16  7:08         ` Coly Li
  0 siblings, 0 replies; 33+ messages in thread
From: Coly Li @ 2020-07-16  7:08 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2020/7/16 15:02, Hannes Reinecke wrote:
> On 7/16/20 8:41 AM, Coly Li wrote:
>> On 2020/7/16 14:15, Hannes Reinecke wrote:
>>> On 7/15/20 4:30 PM, colyli@suse.de wrote:
>>>> From: Coly Li <colyli@suse.de>
>>>>
>>>> The large bucket feature is to extend bucket_size from 16bit to 32bit.
>>>>
>>>> When create cache device on zoned device (e.g. zoned NVMe SSD), making
>>>> a single bucket cover one or more zones of the zoned device is the
>>>> simplest way to support zoned device as cache by bcache.
>>>>
>>>> But current maximum bucket size is 16MB and a typical zone size of
>>>> zoned
>>>> device is 256MB, this is the major motiviation to extend bucket size to
>>>> a larger bit width.
>>>>
>>>> This patch is the basic and first change to support large bucket size,
>>>> the major changes it makes are,
>>>> - Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
>>>>     INCOMPAT means it introduces incompatible on-disk format change.
>>>> - Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
>>>> - Adds __le32 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
>>>>     for the on-disk super block format.
>>>> - For the in-memory super block struct cache_sb, member bucket_size is
>>>>     extended from __u16 to __32.
>>>> - Add get_bucket_size() to combine the bucket_size and bucket_size_hi
>>>>     from struct cache_sb_disk into an unsigned int value.
>>>>
>>>> Since we already have large bucket size helpers meta_bucket_pages(),
>>>> meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
>>>> bucket size > 8MB, the memory allocation for bcache meta data bucket
>>>> won't fail no matter how large the bucket size extended. So these meta
>>>> data buckets are handled properly when the bucket size width increase
>>>> from 16bit to 32bit, we don't need to worry about them.
>>>>
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> ---
>>>>    drivers/md/bcache/alloc.c    |  2 +-
>>>>    drivers/md/bcache/features.c | 22 ++++++++++++++++++++++
>>>>    drivers/md/bcache/features.h |  9 ++++++---
>>>>    drivers/md/bcache/movinggc.c |  4 ++--
>>>>    drivers/md/bcache/super.c    | 23 +++++++++++++++++++----
>>>>    include/uapi/linux/bcache.h  |  3 ++-
>>>>    6 files changed, 52 insertions(+), 11 deletions(-)
>>>>    create mode 100644 drivers/md/bcache/features.c
>>>>
>>>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>>>> index a1df0d95151c..52035a78d836 100644
>>>> --- a/drivers/md/bcache/alloc.c
>>>> +++ b/drivers/md/bcache/alloc.c
>>>> @@ -87,7 +87,7 @@ void bch_rescale_priorities(struct cache_set *c, int
>>>> sectors)
>>>>    {
>>>>        struct cache *ca;
>>>>        struct bucket *b;
>>>> -    unsigned int next = c->nbuckets * c->sb.bucket_size / 1024;
>>>> +    unsigned long next = c->nbuckets * c->sb.bucket_size / 1024;
>>>>        unsigned int i;
>>>>        int r;
>>>>    diff --git a/drivers/md/bcache/features.c
>>>> b/drivers/md/bcache/features.c
>>>> new file mode 100644
>>>> index 000000000000..ba53944bb390
>>>> --- /dev/null
>>>> +++ b/drivers/md/bcache/features.c
>>>> @@ -0,0 +1,22 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Feature set bits and string conversion.
>>>> + * Inspired by ext4's features compat/incompat/ro_compat related code.
>>>> + *
>>>> + * Copyright 2020 Coly Li <colyli@suse.de>
>>>> + *
>>>> + */
>>>> +#include <linux/bcache.h>
>>>> +#include "bcache.h"
>>>> +
>>>> +struct feature {
>>>> +    int        compat;
>>>> +    unsigned int    mask;
>>>> +    const char    *string;
>>>> +};
>>>> +
>>>> +static struct feature feature_list[] = {
>>>> +    {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
>>>> +        "large_bucket"},
>>>> +    {0, 0, 0 },
>>>> +};
>>>> diff --git a/drivers/md/bcache/features.h
>>>> b/drivers/md/bcache/features.h
>>>> index ae7df37b9862..dca052cf5203 100644
>>>> --- a/drivers/md/bcache/features.h
>>>> +++ b/drivers/md/bcache/features.h
>>>> @@ -11,9 +11,13 @@
>>>>    #define BCH_FEATURE_INCOMPAT        2
>>>>    #define BCH_FEATURE_TYPE_MASK        0x03
>>>>    +/* Feature set definition */
>>>> +/* Incompat feature set */
>>>> +#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET    0x0001 /* 32bit bucket
>>>> size */
>>>> +
>>>>    #define BCH_FEATURE_COMPAT_SUUP        0
>>>>    #define BCH_FEATURE_RO_COMPAT_SUUP    0
>>>> -#define BCH_FEATURE_INCOMPAT_SUUP    0
>>>> +#define BCH_FEATURE_INCOMPAT_SUUP    BCH_FEATURE_INCOMPAT_LARGE_BUCKET
>>>>      #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
>>>>            ((sb)->feature_compat & (mask))
>>>> @@ -22,8 +26,6 @@
>>>>    #define BCH_HAS_INCOMPAT_FEATURE(sb, mask) \
>>>>            ((sb)->feature_incompat & (mask))
>>>>    -/* Feature set definition */
>>>> -
>>>>    #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
>>>>    static inline int bch_has_feature_##name(struct cache_sb *sb) \
>>>>    { \
>>>> @@ -75,4 +77,5 @@ static inline void bch_clear_feature_##name(struct
>>>> cache_sb *sb) \
>>>>            ~BCH##_FEATURE_INCOMPAT_##flagname; \
>>>>    }
>>>>    +BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
>>>>    #endif
>>>> diff --git a/drivers/md/bcache/movinggc.c
>>>> b/drivers/md/bcache/movinggc.c
>>>> index b7dd2d75f58c..5872d6470470 100644
>>>> --- a/drivers/md/bcache/movinggc.c
>>>> +++ b/drivers/md/bcache/movinggc.c
>>>> @@ -206,8 +206,8 @@ void bch_moving_gc(struct cache_set *c)
>>>>        mutex_lock(&c->bucket_lock);
>>>>          for_each_cache(ca, c, i) {
>>>> -        unsigned int sectors_to_move = 0;
>>>> -        unsigned int reserve_sectors = ca->sb.bucket_size *
>>>> +        unsigned long sectors_to_move = 0;
>>>> +        unsigned long reserve_sectors = ca->sb.bucket_size *
>>>>                     fifo_used(&ca->free[RESERVE_MOVINGGC]);
>>>>              ca->heap.used = 0;
>>>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>>> index 02901d0ae8e2..e0da52f8e8c9 100644
>>>> --- a/drivers/md/bcache/super.c
>>>> +++ b/drivers/md/bcache/super.c
>>>> @@ -60,6 +60,17 @@ struct workqueue_struct *bch_journal_wq;
>>>>      /* Superblock */
>>>>    +static unsigned int get_bucket_size(struct cache_sb *sb, struct
>>>> cache_sb_disk *s)
>>>> +{
>>>> +    unsigned int bucket_size = le16_to_cpu(s->bucket_size);
>>>> +
>>>> +    if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
>>>> +         bch_has_feature_large_bucket(sb))
>>>> +        bucket_size |= le32_to_cpu(s->bucket_size_hi) << 16;
>>>> +
>>>> +    return bucket_size;
>>>> +}
>>>> +
>>>>    static const char *read_super_common(struct cache_sb *sb,  struct
>>>> block_device *bdev,
>>>>                         struct cache_sb_disk *s)
>>>>    {
>>> That is a bit unfortunate; bucket_size_hi is 32 bits, so we might end up
>>> with an overflow here if bucket_size_hi is larger that USHRT_MAX.
>>>
>>
>> In bcache-tools, the maximum value of bucket_size is restricted to
>> UINT_MAX in make_bcache(),
>>     case 'b':
>>         bucket_size =
>>             hatoi_validate(optarg, "bucket size", UINT_MAX);
>>         break;
>>
>> So the overflow won't happen if people use bcache program to make the
>> cache device.
>>
>> If people want to modify bucket_size_hi themselves, in
>> read_super_common(), there are several checks to make sure the hacked
>> cache_sb->bucket_size composed from the hacked bucket_size_hi won't mess
>> up whole kernel,
>> 1) should be power of 2
>>     if (!is_power_of_2(sb->bucket_size))
>> 2) should large than page size
>>     if (sb->bucket_size < PAGE_SECTORS)
>> 3) should match sb->nbuckets
>>     if (get_capacity(bdev->bd_disk) <
>>         sb->bucket_size * sb->nbuckets)
>> 4) no overlap with super block
>>     if (sb->first_bucket * sb->bucket_size < 16)
>>
>> Therefore the overflow won't happen, and even happen by hacking it won't
>> panic kernel.
>>
>>> So to avoid overflow either make bucket_size_hi 16 bit, too, or define
>>> this feature such that the original bucket_size field is ignored and
>>> just the new size field is used.
>>>
>>
>> Now cache_sb_disk is for on-disk format, it contains 16bit bucket_size
>> and 32bit bucket_size_hi. cache_sb is for in-memory object only, it
>> simply has a 32bit bucket_size.
>>
>> Indeed I planed to set cache_sb->bucket_size to be uint64_t, but I feel
>> people won't use bucket size > 1TB, and finally make it to be a 32bit
>> value.
>>
>> I will add more code comments to explain why there won't be overflow in
>> get_bucket_size() because user space tool limits the maximum size.
>>
> So why not make bucket_size_hi 16-bit?
> That would clear up the ambiguity, and you wouldn't need to add comments
> ...

I did seriously think maybe many years later for some crazy reason
people do want >4TB bucket size.... also I though 48bit bucket size
should be large enough in whole Linux life cycle....

Maybe I can set bucket_size_hi as 16bit and pad the following 2 bytes
for the on-disk format, or just make buket_size_hi as 16bit if I was
over thoughtful.

Coly Li

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, back to index

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 14:29 [PATCH v3 00/16] bcache: extend bucket size to 32bit width colyli
2020-07-15 14:30 ` [PATCH v3 01/16] bcache: add read_super_common() to read major part of super block colyli
2020-07-15 14:30 ` [PATCH v3 02/16] bcache: add more accurate error information in read_super_common() colyli
2020-07-15 14:30 ` [PATCH v3 03/16] bcache: disassemble the big if() checks in bch_cache_set_alloc() colyli
2020-07-15 14:30 ` [PATCH v3 04/16] bcache: fix super block seq numbers comparision in register_cache_set() colyli
2020-07-15 14:30 ` [PATCH v3 05/16] bcache: increase super block version for cache device and backing device colyli
2020-07-15 14:30 ` [PATCH v3 06/16] bcache: move bucket related code into read_super_common() colyli
2020-07-15 14:30 ` [PATCH v3 07/16] bcache: struct cache_sb is only for in-memory super block now colyli
2020-07-15 18:21   ` Christoph Hellwig
2020-07-16  3:31     ` Coly Li
2020-07-15 14:30 ` [PATCH v3 08/16] bcache: introduce meta_bucket_pages() related helper routines colyli
2020-07-15 15:36   ` Hannes Reinecke
2020-07-15 16:00     ` Coly Li
2020-07-15 14:30 ` [PATCH v3 09/16] bcache: handle c->uuids properly for bucket size > 8MB colyli
2020-07-15 15:37   ` Hannes Reinecke
2020-07-15 14:30 ` [PATCH v3 10/16] bcache: handle cache prio_buckets and disk_buckets " colyli
2020-07-15 15:38   ` Hannes Reinecke
2020-07-15 14:30 ` [PATCH v3 11/16] bcache: handle cache set verify_ondisk " colyli
2020-07-16  6:07   ` Hannes Reinecke
2020-07-15 14:30 ` [PATCH v3 12/16] bcache: handle btree node memory allocation " colyli
2020-07-16  6:08   ` Hannes Reinecke
2020-07-15 14:30 ` [PATCH v3 13/16] bcache: add bucket_size_hi into struct cache_sb_disk for large bucket colyli
2020-07-16  6:15   ` Hannes Reinecke
2020-07-16  6:41     ` Coly Li
2020-07-16  7:02       ` Hannes Reinecke
2020-07-16  7:08         ` Coly Li
2020-07-15 14:30 ` [PATCH v3 14/16] bcache: add sysfs file to display feature sets information of cache set colyli
2020-07-16  6:17   ` Hannes Reinecke
2020-07-16  6:20     ` Coly Li
2020-07-15 14:30 ` [PATCH v3 15/16] bcache: avoid extra memory allocation from mempool c->fill_iter colyli
2020-07-16  6:18   ` Hannes Reinecke
2020-07-15 14:30 ` [PATCH v3 16/16] bcache: avoid extra memory consumption in struct bbio for large bucket size colyli
2020-07-16  6:18   ` Hannes Reinecke

Linux-bcache Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-bcache/0 linux-bcache/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-bcache linux-bcache/ https://lore.kernel.org/linux-bcache \
		linux-bcache@vger.kernel.org
	public-inbox-index linux-bcache

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-bcache


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git