* [PATCH 01/20] bcache: consider the fragmentation when update the writeback rate
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 02/20] bcache: Fix register_device_aync typo Coly Li
` (19 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, dongdong tao, Coly Li
From: dongdong tao <dongdong.tao@canonical.com>
Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond.
the initial rate at each stage can be controlled by 3 configurable
parameters writeback_rate_fp_term_{low|mid|high}, they are by default
1, 10, 1000, the hint of IO throughput that these values are trying
to achieve is described by above paragraph, the reason that
I choose those value as default is based on the testing and the
production data, below is some details:
A. When it comes to the low stage, there is still a bit far from the 70
threshold, so we only want to give it a little bit push by setting the
term to 1, it means the initial rate will be 170 if the fragment is 6,
it is calculated by bucket_size/fragment, this rate is very small,
but still much reasonable than the minimum 8.
For a production bcache with unheavy workload, if the cache device
is bigger than 1 TB, it may take hours to consume 1% buckets,
so it is very possible to reclaim enough dirty buckets in this stage,
thus to avoid entering the next stage.
B. If the dirty buckets ratio didn't turn around during the first stage,
it comes to the mid stage, then it is necessary for mid stage
to be more aggressive than low stage, so i choose the initial rate
to be 10 times more than low stage, that means 1700 as the initial
rate if the fragment is 6. This is some normal rate
we usually see for a normal workload when writeback happens
because of writeback_percent.
C. If the dirty buckets ratio didn't turn around during the low and mid
stages, it comes to the third stage, and it is the last chance that
we can turn around to avoid the horrible cutoff writeback sync issue,
then we choose 100 times more aggressive than the mid stage, that
means 170000 as the initial rate if the fragment is 6. This is also
inferred from a production bcache, I've got one week's writeback rate
data from a production bcache which has quite heavy workloads,
again, the writeback is triggered by the writeback percent,
the highest rate area is around 100000 to 240000, so I believe this
kind aggressiveness at this stage is reasonable for production.
And it should be mostly enough because the hint is trying to reclaim
1000 bucket per second, and from that heavy production env,
it is consuming 50 bucket per second on average in one week's data.
Option writeback_consider_fragment is to control whether we want
this feature to be on or off, it's on by default.
Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/bcache.h | 4 ++++
drivers/md/bcache/sysfs.c | 23 +++++++++++++++++++
drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
drivers/md/bcache/writeback.h | 4 ++++
4 files changed, 73 insertions(+)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..d7a84327b7f1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -373,6 +373,7 @@ struct cached_dev {
unsigned int partial_stripes_expensive:1;
unsigned int writeback_metadata:1;
unsigned int writeback_running:1;
+ unsigned int writeback_consider_fragment:1;
unsigned char writeback_percent;
unsigned int writeback_delay;
@@ -385,6 +386,9 @@ struct cached_dev {
unsigned int writeback_rate_update_seconds;
unsigned int writeback_rate_i_term_inverse;
unsigned int writeback_rate_p_term_inverse;
+ unsigned int writeback_rate_fp_term_low;
+ unsigned int writeback_rate_fp_term_mid;
+ unsigned int writeback_rate_fp_term_high;
unsigned int writeback_rate_minimum;
enum stop_on_failure stop_when_cache_set_failed;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 00a520c03f41..eef15f8022ba 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -117,10 +117,14 @@ rw_attribute(writeback_running);
rw_attribute(writeback_percent);
rw_attribute(writeback_delay);
rw_attribute(writeback_rate);
+rw_attribute(writeback_consider_fragment);
rw_attribute(writeback_rate_update_seconds);
rw_attribute(writeback_rate_i_term_inverse);
rw_attribute(writeback_rate_p_term_inverse);
+rw_attribute(writeback_rate_fp_term_low);
+rw_attribute(writeback_rate_fp_term_mid);
+rw_attribute(writeback_rate_fp_term_high);
rw_attribute(writeback_rate_minimum);
read_attribute(writeback_rate_debug);
@@ -195,6 +199,7 @@ SHOW(__bch_cached_dev)
var_printf(bypass_torture_test, "%i");
var_printf(writeback_metadata, "%i");
var_printf(writeback_running, "%i");
+ var_printf(writeback_consider_fragment, "%i");
var_print(writeback_delay);
var_print(writeback_percent);
sysfs_hprint(writeback_rate,
@@ -205,6 +210,9 @@ SHOW(__bch_cached_dev)
var_print(writeback_rate_update_seconds);
var_print(writeback_rate_i_term_inverse);
var_print(writeback_rate_p_term_inverse);
+ var_print(writeback_rate_fp_term_low);
+ var_print(writeback_rate_fp_term_mid);
+ var_print(writeback_rate_fp_term_high);
var_print(writeback_rate_minimum);
if (attr == &sysfs_writeback_rate_debug) {
@@ -303,6 +311,7 @@ STORE(__cached_dev)
sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test);
sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata);
sysfs_strtoul_bool(writeback_running, dc->writeback_running);
+ sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment);
sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
@@ -331,6 +340,16 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
dc->writeback_rate_p_term_inverse,
1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_low,
+ dc->writeback_rate_fp_term_low,
+ 1, dc->writeback_rate_fp_term_mid - 1);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
+ dc->writeback_rate_fp_term_mid,
+ dc->writeback_rate_fp_term_low + 1,
+ dc->writeback_rate_fp_term_high - 1);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_high,
+ dc->writeback_rate_fp_term_high,
+ dc->writeback_rate_fp_term_mid + 1, UINT_MAX);
sysfs_strtoul_clamp(writeback_rate_minimum,
dc->writeback_rate_minimum,
1, UINT_MAX);
@@ -499,9 +518,13 @@ static struct attribute *bch_cached_dev_files[] = {
&sysfs_writeback_delay,
&sysfs_writeback_percent,
&sysfs_writeback_rate,
+ &sysfs_writeback_consider_fragment,
&sysfs_writeback_rate_update_seconds,
&sysfs_writeback_rate_i_term_inverse,
&sysfs_writeback_rate_p_term_inverse,
+ &sysfs_writeback_rate_fp_term_low,
+ &sysfs_writeback_rate_fp_term_mid,
+ &sysfs_writeback_rate_fp_term_high,
&sysfs_writeback_rate_minimum,
&sysfs_writeback_rate_debug,
&sysfs_io_errors,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a129e4d2707c..82d4e0880a99 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc)
int64_t integral_scaled;
uint32_t new_rate;
+ /*
+ * We need to consider the number of dirty buckets as well
+ * when calculating the proportional_scaled, Otherwise we might
+ * have an unreasonable small writeback rate at a highly fragmented situation
+ * when very few dirty sectors consumed a lot dirty buckets, the
+ * worst case is when dirty buckets reached cutoff_writeback_sync and
+ * dirty data is still not even reached to writeback percent, so the rate
+ * still will be at the minimum value, which will cause the write
+ * stuck at a non-writeback mode.
+ */
+ struct cache_set *c = dc->disk.c;
+
+ int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
+
+ if (dc->writeback_consider_fragment &&
+ c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
+ int64_t fragment =
+ div_s64((dirty_buckets * c->cache->sb.bucket_size), dirty);
+ int64_t fp_term;
+ int64_t fps;
+
+ if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
+ fp_term = dc->writeback_rate_fp_term_low *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
+ } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
+ fp_term = dc->writeback_rate_fp_term_mid *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
+ } else {
+ fp_term = dc->writeback_rate_fp_term_high *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
+ }
+ fps = div_s64(dirty, dirty_buckets) * fp_term;
+ if (fragment > 3 && fps > proportional_scaled) {
+ /* Only overrite the p when fragment > 3 */
+ proportional_scaled = fps;
+ }
+ }
+
if ((error < 0 && dc->writeback_rate_integral > 0) ||
(error > 0 && time_before64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_metadata = true;
dc->writeback_running = false;
+ dc->writeback_consider_fragment = true;
dc->writeback_percent = 10;
dc->writeback_delay = 30;
atomic_long_set(&dc->writeback_rate.rate, 1024);
@@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
dc->writeback_rate_p_term_inverse = 40;
+ dc->writeback_rate_fp_term_low = 1;
+ dc->writeback_rate_fp_term_mid = 10;
+ dc->writeback_rate_fp_term_high = 1000;
dc->writeback_rate_i_term_inverse = 10000;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 3f1230e22de0..02b2f9df73f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -16,6 +16,10 @@
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
+
#define BCH_DIRTY_INIT_THRD_MAX 64
/*
* 14 (16384ths) is chosen here as something that each backing device
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 02/20] bcache: Fix register_device_aync typo
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
2021-02-10 5:07 ` [PATCH 01/20] bcache: consider the fragmentation when update the writeback rate Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 03/20] Revert "bcache: Kill btree_io_wq" Coly Li
` (18 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Kai Krakow, Coly Li
From: Kai Krakow <kai@kaishome.de>
Should be `register_device_async`.
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/super.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 2047a9cccdb5..e7d1b52c5cc8 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2517,7 +2517,7 @@ static void register_cache_worker(struct work_struct *work)
module_put(THIS_MODULE);
}
-static void register_device_aync(struct async_reg_args *args)
+static void register_device_async(struct async_reg_args *args)
{
if (SB_IS_BDEV(args->sb))
INIT_DELAYED_WORK(&args->reg_work, register_bdev_worker);
@@ -2611,7 +2611,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
args->sb = sb;
args->sb_disk = sb_disk;
args->bdev = bdev;
- register_device_aync(args);
+ register_device_async(args);
/* No wait and returns to user space */
goto async_done;
}
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 03/20] Revert "bcache: Kill btree_io_wq"
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
2021-02-10 5:07 ` [PATCH 01/20] bcache: consider the fragmentation when update the writeback rate Coly Li
2021-02-10 5:07 ` [PATCH 02/20] bcache: Fix register_device_aync typo Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 04/20] bcache: Give btree_io_wq correct semantics again Coly Li
` (17 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Kai Krakow, Coly Li, stable
From: Kai Krakow <kai@kaishome.de>
This reverts commit 56b30770b27d54d68ad51eccc6d888282b568cee.
With the btree using the `system_wq`, I seem to see a lot more desktop
latency than I should.
After some more investigation, it looks like the original assumption
of 56b3077 no longer is true, and bcache has a very high potential of
congesting the `system_wq`. In turn, this introduces laggy desktop
performance, IO stalls (at least with btrfs), and input events may be
delayed.
So let's revert this. It's important to note that the semantics of
using `system_wq` previously mean that `btree_io_wq` should be created
before and destroyed after other bcache wqs to keep the same
assumptions.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/bcache.h | 2 ++
drivers/md/bcache/btree.c | 21 +++++++++++++++++++--
drivers/md/bcache/super.c | 4 ++++
3 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index d7a84327b7f1..2b8c7dd2cfae 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -1046,5 +1046,7 @@ void bch_debug_exit(void);
void bch_debug_init(void);
void bch_request_exit(void);
int bch_request_init(void);
+void bch_btree_exit(void);
+int bch_btree_init(void);
#endif /* _BCACHE_H */
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 910df242c83d..952f022db5a5 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -99,6 +99,8 @@
#define PTR_HASH(c, k) \
(((k)->ptr[0] >> c->bucket_bits) | PTR_GEN(k, 0))
+static struct workqueue_struct *btree_io_wq;
+
#define insert_lock(s, b) ((b)->level <= (s)->lock)
@@ -308,7 +310,7 @@ static void __btree_node_write_done(struct closure *cl)
btree_complete_write(b, w);
if (btree_node_dirty(b))
- schedule_delayed_work(&b->work, 30 * HZ);
+ queue_delayed_work(btree_io_wq, &b->work, 30 * HZ);
closure_return_with_destructor(cl, btree_node_write_unlock);
}
@@ -481,7 +483,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
BUG_ON(!i->keys);
if (!btree_node_dirty(b))
- schedule_delayed_work(&b->work, 30 * HZ);
+ queue_delayed_work(btree_io_wq, &b->work, 30 * HZ);
set_btree_node_dirty(b);
@@ -2764,3 +2766,18 @@ void bch_keybuf_init(struct keybuf *buf)
spin_lock_init(&buf->lock);
array_allocator_init(&buf->freelist);
}
+
+void bch_btree_exit(void)
+{
+ if (btree_io_wq)
+ destroy_workqueue(btree_io_wq);
+}
+
+int __init bch_btree_init(void)
+{
+ btree_io_wq = create_singlethread_workqueue("bch_btree_io");
+ if (!btree_io_wq)
+ return -ENOMEM;
+
+ return 0;
+}
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e7d1b52c5cc8..85a44a0cffe0 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2821,6 +2821,7 @@ static void bcache_exit(void)
destroy_workqueue(bcache_wq);
if (bch_journal_wq)
destroy_workqueue(bch_journal_wq);
+ bch_btree_exit();
if (bcache_major)
unregister_blkdev(bcache_major, "bcache");
@@ -2876,6 +2877,9 @@ static int __init bcache_init(void)
return bcache_major;
}
+ if (bch_btree_init())
+ goto err;
+
bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0);
if (!bcache_wq)
goto err;
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 04/20] bcache: Give btree_io_wq correct semantics again
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (2 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 03/20] Revert "bcache: Kill btree_io_wq" Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 05/20] bcache: Move journal work to new flush wq Coly Li
` (16 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Kai Krakow, Coly Li, stable
From: Kai Krakow <kai@kaishome.de>
Before killing `btree_io_wq`, the queue was allocated using
`create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
killing it, it no longer had this property but `system_wq` is not
single threaded.
Let's combine both worlds and make it multi threaded but able to
reclaim memory.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/btree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 952f022db5a5..fe6dce125aba 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2775,7 +2775,7 @@ void bch_btree_exit(void)
int __init bch_btree_init(void)
{
- btree_io_wq = create_singlethread_workqueue("bch_btree_io");
+ btree_io_wq = alloc_workqueue("bch_btree_io", WQ_MEM_RECLAIM, 0);
if (!btree_io_wq)
return -ENOMEM;
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 05/20] bcache: Move journal work to new flush wq
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (3 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 04/20] bcache: Give btree_io_wq correct semantics again Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 06/20] bcache: Avoid comma separated statements Coly Li
` (15 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Kai Krakow, Coly Li, stable
From: Kai Krakow <kai@kaishome.de>
This is potentially long running and not latency sensitive, let's get
it out of the way of other latency sensitive events.
As observed in the previous commit, the `system_wq` comes easily
congested by bcache, and this fixes a few more stalls I was observing
every once in a while.
Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
of boot and file system operations in my tests. Also, without
`WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
previous behavior as `system_wq` also does no memory reclaim:
> // workqueue.c:
> system_wq = alloc_workqueue("events", 0, 0);
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/bcache.h | 1 +
drivers/md/bcache/journal.c | 4 ++--
drivers/md/bcache/super.c | 16 ++++++++++++++++
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 2b8c7dd2cfae..848dd4db1659 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -1005,6 +1005,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent);
extern struct workqueue_struct *bcache_wq;
extern struct workqueue_struct *bch_journal_wq;
+extern struct workqueue_struct *bch_flush_wq;
extern struct mutex bch_register_lock;
extern struct list_head bch_cache_sets;
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index aefbdb7e003b..c6613e817333 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -932,8 +932,8 @@ atomic_t *bch_journal(struct cache_set *c,
journal_try_write(c);
} else if (!w->dirty) {
w->dirty = true;
- schedule_delayed_work(&c->journal.work,
- msecs_to_jiffies(c->journal_delay_ms));
+ queue_delayed_work(bch_flush_wq, &c->journal.work,
+ msecs_to_jiffies(c->journal_delay_ms));
spin_unlock(&c->journal.lock);
} else {
spin_unlock(&c->journal.lock);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85a44a0cffe0..0228ccb293fc 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -49,6 +49,7 @@ static int bcache_major;
static DEFINE_IDA(bcache_device_idx);
static wait_queue_head_t unregister_wait;
struct workqueue_struct *bcache_wq;
+struct workqueue_struct *bch_flush_wq;
struct workqueue_struct *bch_journal_wq;
@@ -2821,6 +2822,8 @@ static void bcache_exit(void)
destroy_workqueue(bcache_wq);
if (bch_journal_wq)
destroy_workqueue(bch_journal_wq);
+ if (bch_flush_wq)
+ destroy_workqueue(bch_flush_wq);
bch_btree_exit();
if (bcache_major)
@@ -2884,6 +2887,19 @@ static int __init bcache_init(void)
if (!bcache_wq)
goto err;
+ /*
+ * Let's not make this `WQ_MEM_RECLAIM` for the following reasons:
+ *
+ * 1. It used `system_wq` before which also does no memory reclaim.
+ * 2. With `WQ_MEM_RECLAIM` desktop stalls, increased boot times, and
+ * reduced throughput can be observed.
+ *
+ * We still want to user our own queue to not congest the `system_wq`.
+ */
+ bch_flush_wq = alloc_workqueue("bch_flush", 0, 0);
+ if (!bch_flush_wq)
+ goto err;
+
bch_journal_wq = alloc_workqueue("bch_journal", WQ_MEM_RECLAIM, 0);
if (!bch_journal_wq)
goto err;
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 06/20] bcache: Avoid comma separated statements
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (4 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 05/20] bcache: Move journal work to new flush wq Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 07/20] bcache: add initial data structures for nvm pages Coly Li
` (14 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Joe Perches, Coly Li
From: Joe Perches <joe@perches.com>
Use semicolons and braces.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/bset.c | 12 ++++++++----
drivers/md/bcache/sysfs.c | 6 ++++--
2 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 67a2c47f4201..94d38e8a59b3 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -712,8 +712,10 @@ void bch_bset_build_written_tree(struct btree_keys *b)
for (j = inorder_next(0, t->size);
j;
j = inorder_next(j, t->size)) {
- while (bkey_to_cacheline(t, k) < cacheline)
- prev = k, k = bkey_next(k);
+ while (bkey_to_cacheline(t, k) < cacheline) {
+ prev = k;
+ k = bkey_next(k);
+ }
t->prev[j] = bkey_u64s(prev);
t->tree[j].m = bkey_to_cacheline_offset(t, cacheline++, k);
@@ -901,8 +903,10 @@ unsigned int bch_btree_insert_key(struct btree_keys *b, struct bkey *k,
status = BTREE_INSERT_STATUS_INSERT;
while (m != bset_bkey_last(i) &&
- bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0)
- prev = m, m = bkey_next(m);
+ bkey_cmp(k, b->ops->is_extents ? &START_KEY(m) : m) > 0) {
+ prev = m;
+ m = bkey_next(m);
+ }
/* prev is in the tree, if we merge we're done */
status = BTREE_INSERT_STATUS_BACK_MERGE;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index eef15f8022ba..cc89f3156d1a 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -1094,8 +1094,10 @@ SHOW(__bch_cache)
--n;
while (cached < p + n &&
- *cached == BTREE_PRIO)
- cached++, n--;
+ *cached == BTREE_PRIO) {
+ cached++;
+ n--;
+ }
for (i = 0; i < n; i++)
sum += INITIAL_PRIO - cached[i];
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 07/20] bcache: add initial data structures for nvm pages
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (5 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 06/20] bcache: Avoid comma separated statements Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 15:09 ` Jens Axboe
2021-02-10 5:07 ` [PATCH 08/20] bcache: initialize the nvm pages allocator Coly Li
` (13 subsequent siblings)
20 siblings, 1 reply; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
This patch initializes the prototype data structures for nvm pages
allocator,
- struct bch_nvm_pages_sb
This is the super block allocated on each nvdimm namespace. A nvdimm
set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
to mark which nvdimm set this name space belongs to. Normally we will
use the bcache's cache set UUID to initialize this uuid, to connect this
nvdimm set to a specified bcache cache set.
- struct bch_owner_list_head
This is a table for all heads of all owner lists. A owner list records
which page(s) allocated to which owner. After reboot from power failure,
the ownwer may find all its requested and allocated pages from the owner
list by a handler which is converted by a UUID.
- struct bch_nvm_pages_owner_head
This is a head of an owner list. Each owner only has one owner list,
and a nvm page only belongs to an specific owner. uuid[] will be set to
owner's uuid, for bcache it is the bcache's cache set uuid. label is not
mandatory, it is a human-readable string for debug purpose. The pointer
*recs references to separated nvm page which hold the table of struct
bch_nvm_pgalloc_rec.
- struct bch_nvm_pgalloc_recs
This struct occupies a whole page, owner_uuid should match the uuid
in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
allocated records.
- struct bch_nvm_pgalloc_rec
Each structure records a range of allocated nvm pages. pgoff is offset
in unit of page size of this allocated nvm page range. The adjoint page
ranges of same owner can be merged into a larger one, therefore pages_nr
is NOT always power of 2.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
include/uapi/linux/bcache-nvm.h | 195 ++++++++++++++++++++++++++++++++
1 file changed, 195 insertions(+)
create mode 100644 include/uapi/linux/bcache-nvm.h
diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
new file mode 100644
index 000000000000..61108bf2a63e
--- /dev/null
+++ b/include/uapi/linux/bcache-nvm.h
@@ -0,0 +1,195 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef _UAPI_BCACHE_NVM_H
+#define _UAPI_BCACHE_NVM_H
+
+/*
+ * Bcache on NVDIMM data structures
+ */
+
+/*
+ * - struct bch_nvm_pages_sb
+ * This is the super block allocated on each nvdimm namespace. A nvdimm
+ * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
+ * which nvdimm set this name space belongs to. Normally we will use the
+ * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
+ * set to a specified bcache cache set.
+ *
+ * - struct bch_owner_list_head
+ * This is a table for all heads of all owner lists. A owner list records
+ * which page(s) allocated to which owner. After reboot from power failure,
+ * the ownwer may find all its requested and allocated pages from the owner
+ * list by a handler which is converted by a UUID.
+ *
+ * - struct bch_nvm_pages_owner_head
+ * This is a head of an owner list. Each owner only has one owner list,
+ * and a nvm page only belongs to an specific owner. uuid[] will be set to
+ * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
+ * mandatory, it is a human-readable string for debug purpose. The pointer
+ * recs references to separated nvm page which hold the table of struct
+ * bch_pgalloc_rec.
+ *
+ *- struct bch_nvm_pgalloc_recs
+ * This structure occupies a whole page, owner_uuid should match the uuid
+ * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
+ * allocated records.
+ *
+ * - struct bch_pgalloc_rec
+ * Each structure records a range of allocated nvm pages. pgoff is offset
+ * in unit of page size of this allocated nvm page range. The adjoint page
+ * ranges of same owner can be merged into a larger one, therefore pages_nr
+ * is NOT always power of 2.
+ *
+ *
+ * Memory layout on nvdimm namespace 0
+ *
+ * 0 +---------------------------------+
+ * | |
+ * 4KB +---------------------------------+
+ * | bch_nvm_pages_sb |
+ * 8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
+ * | bch_owner_list_head |
+ * | |
+ * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
+ * | bch_nvm_pgalloc_recs |
+ * | (nvm pages internal usage) |
+ * 24KB +---------------------------------+
+ * | |
+ * | |
+ * 16MB +---------------------------------+
+ * | allocable nvm pages |
+ * | for buddy allocator |
+ * end +---------------------------------+
+ *
+ *
+ *
+ * Memory layout on nvdimm namespace N
+ * (doesn't have owner list)
+ *
+ * 0 +---------------------------------+
+ * | |
+ * 4KB +---------------------------------+
+ * | bch_nvm_pages_sb |
+ * 8KB +---------------------------------+
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * 16MB +---------------------------------+
+ * | allocable nvm pages |
+ * | for buddy allocator |
+ * end +---------------------------------+
+ *
+ */
+
+#include <linux/types.h>
+
+/* In sectors */
+#define BCH_NVM_PAGES_SB_OFFSET 4096
+#define BCH_NVM_PAGES_OFFSET (16 << 20)
+
+#define BCH_NVM_PAGES_LABEL_SIZE 32
+#define BCH_NVM_PAGES_NAMESPACES_MAX 8
+
+#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET (8<<10)
+#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET (16<<10)
+
+#define BCH_NVM_PAGES_SB_VERSION 0
+#define BCH_NVM_PAGES_SB_VERSION_MAX 0
+
+static const char bch_nvm_pages_magic[] = {
+ 0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
+ 0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
+static const char bch_nvm_pages_pgalloc_magic[] = {
+ 0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
+ 0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
+
+struct bch_pgalloc_rec {
+ __u32 pgoff;
+ __u32 nr;
+};
+
+struct bch_nvm_pgalloc_recs {
+union {
+ struct {
+ struct bch_nvm_pages_owner_head *owner;
+ struct bch_nvm_pgalloc_recs *next;
+ __u8 magic[16];
+ __u8 owner_uuid[16];
+ __u32 size;
+ __u32 used;
+ __u64 _pad[4];
+ struct bch_pgalloc_rec recs[];
+ };
+ __u8 pad[8192];
+};
+};
+#define BCH_MAX_RECS \
+ ((sizeof(struct bch_nvm_pgalloc_recs) - \
+ offsetof(struct bch_nvm_pgalloc_recs, recs)) / \
+ sizeof(struct bch_pgalloc_rec))
+
+struct bch_nvm_pages_owner_head {
+ __u8 uuid[16];
+ char label[BCH_NVM_PAGES_LABEL_SIZE];
+ /* Per-namespace own lists */
+ struct bch_nvm_pgalloc_recs *recs[BCH_NVM_PAGES_NAMESPACES_MAX];
+};
+
+/* heads[0] is always for nvm_pages internal usage */
+struct bch_owner_list_head {
+union {
+ struct {
+ __u32 size;
+ __u32 used;
+ __u64 _pad[4];
+ struct bch_nvm_pages_owner_head heads[];
+ };
+ __u8 pad[8192];
+};
+};
+#define BCH_MAX_OWNER_LIST \
+ ((sizeof(struct bch_owner_list_head) - \
+ offsetof(struct bch_owner_list_head, heads)) / \
+ sizeof(struct bch_nvm_pages_owner_head))
+
+/* The on-media bit order is local CPU order */
+struct bch_nvm_pages_sb {
+ __u64 csum;
+ __u64 ns_start;
+ __u64 sb_offset;
+ __u64 version;
+ __u8 magic[16];
+ __u8 uuid[16];
+ __u32 page_size;
+ __u32 total_namespaces_nr;
+ __u32 this_namespace_nr;
+ union {
+ __u8 set_uuid[16];
+ __u64 set_magic;
+ };
+
+ __u64 flags;
+ __u64 seq;
+
+ __u64 feature_compat;
+ __u64 feature_incompat;
+ __u64 feature_ro_compat;
+
+ /* For allocable nvm pages from buddy systems */
+ __u64 pages_offset;
+ __u64 pages_total;
+
+ __u64 pad[8];
+
+ /* Only on the first name space */
+ struct bch_owner_list_head *owner_list_head;
+
+ /* Just for csum_set() */
+ __u32 keys;
+ __u64 d[0];
+};
+
+#endif /* _UAPI_BCACHE_NVM_H */
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 07/20] bcache: add initial data structures for nvm pages
2021-02-10 5:07 ` [PATCH 07/20] bcache: add initial data structures for nvm pages Coly Li
@ 2021-02-10 15:09 ` Jens Axboe
2021-02-11 3:58 ` Coly Li
0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-02-10 15:09 UTC (permalink / raw)
To: Coly Li; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren
On 2/9/21 10:07 PM, Coly Li wrote:
> +struct bch_nvm_pgalloc_recs {
> +union {
> + struct {
> + struct bch_nvm_pages_owner_head *owner;
> + struct bch_nvm_pgalloc_recs *next;
> + __u8 magic[16];
> + __u8 owner_uuid[16];
> + __u32 size;
> + __u32 used;
> + __u64 _pad[4];
> + struct bch_pgalloc_rec recs[];
> + };
> + __u8 pad[8192];
> +};
> +};
This doesn't look right in a user header, any user API should be 32-bit
and 64-bit agnostic.
> +struct bch_nvm_pages_owner_head {
> + __u8 uuid[16];
> + char label[BCH_NVM_PAGES_LABEL_SIZE];
> + /* Per-namespace own lists */
> + struct bch_nvm_pgalloc_recs *recs[BCH_NVM_PAGES_NAMESPACES_MAX];
> +};
Same here.
> +/* heads[0] is always for nvm_pages internal usage */
> +struct bch_owner_list_head {
> +union {
> + struct {
> + __u32 size;
> + __u32 used;
> + __u64 _pad[4];
> + struct bch_nvm_pages_owner_head heads[];
> + };
> + __u8 pad[8192];
> +};
> +};
And here.
> +#define BCH_MAX_OWNER_LIST \
> + ((sizeof(struct bch_owner_list_head) - \
> + offsetof(struct bch_owner_list_head, heads)) / \
> + sizeof(struct bch_nvm_pages_owner_head))
> +
> +/* The on-media bit order is local CPU order */
> +struct bch_nvm_pages_sb {
> + __u64 csum;
> + __u64 ns_start;
> + __u64 sb_offset;
> + __u64 version;
> + __u8 magic[16];
> + __u8 uuid[16];
> + __u32 page_size;
> + __u32 total_namespaces_nr;
> + __u32 this_namespace_nr;
> + union {
> + __u8 set_uuid[16];
> + __u64 set_magic;
> + };
This doesn't look like it packs right either.
> +
> + __u64 flags;
> + __u64 seq;
> +
> + __u64 feature_compat;
> + __u64 feature_incompat;
> + __u64 feature_ro_compat;
> +
> + /* For allocable nvm pages from buddy systems */
> + __u64 pages_offset;
> + __u64 pages_total;
> +
> + __u64 pad[8];
> +
> + /* Only on the first name space */
> + struct bch_owner_list_head *owner_list_head;
And here's another pointer...
--
Jens Axboe
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 07/20] bcache: add initial data structures for nvm pages
2021-02-10 15:09 ` Jens Axboe
@ 2021-02-11 3:58 ` Coly Li
0 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-11 3:58 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren
On 2/10/21 11:09 PM, Jens Axboe wrote:
> On 2/9/21 10:07 PM, Coly Li wrote:
>> +struct bch_nvm_pgalloc_recs {
>> +union {
>> + struct {
>> + struct bch_nvm_pages_owner_head *owner;
>> + struct bch_nvm_pgalloc_recs *next;
>> + __u8 magic[16];
>> + __u8 owner_uuid[16];
>> + __u32 size;
>> + __u32 used;
>> + __u64 _pad[4];
>> + struct bch_pgalloc_rec recs[];
>> + };
>> + __u8 pad[8192];
>> +};
>> +};
>
Hi Jens,
> This doesn't look right in a user header, any user API should be 32-bit
> and 64-bit agnostic.
The above data structure is stored in NVDIMM as allocator's meta data.
It is designed to be directly accessed (in future update) as in-memory
object, but stored on non-volatiled memory like on-disk data structure.
To me, it is fine to use unsigned int/long/long long to define the
members, because nvdimm driver only works on 64bit platform. It is just
unclear to me which form/style I should use to define such data
structure. On one side they are stores as non-volatiled media, on other
side they are accessed directly as in-memory object...
>
>> +struct bch_nvm_pages_owner_head {
>> + __u8 uuid[16];
>> + char label[BCH_NVM_PAGES_LABEL_SIZE];
>> + /* Per-namespace own lists */
>> + struct bch_nvm_pgalloc_recs *recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>> +};
>
> Same here.
For the above pointer, it is the same reason. In later version, such
object on NVDIMM will be referenced directly by an in-memory pointer
like we normally do for an in-memory object.
Therefore I do treat the data structure as in-memory object after the
DAX mapping accomplished. If not define it as an in-memory pointer, I
have to cast it into (void *) every time when I use it.
>
>> +/* heads[0] is always for nvm_pages internal usage */
>> +struct bch_owner_list_head {
>> +union {
>> + struct {
>> + __u32 size;
>> + __u32 used;
>> + __u64 _pad[4];
>> + struct bch_nvm_pages_owner_head heads[];
>> + };
>> + __u8 pad[8192];
>> +};
>> +};
>
> And here.
>
>> +#define BCH_MAX_OWNER_LIST \
>> + ((sizeof(struct bch_owner_list_head) - \
>> + offsetof(struct bch_owner_list_head, heads)) / \
>> + sizeof(struct bch_nvm_pages_owner_head))
>> +
>> +/* The on-media bit order is local CPU order */
>> +struct bch_nvm_pages_sb {
>> + __u64 csum;
>> + __u64 ns_start;
>> + __u64 sb_offset;
>> + __u64 version;
>> + __u8 magic[16];
>> + __u8 uuid[16];
>> + __u32 page_size;
>> + __u32 total_namespaces_nr;
>> + __u32 this_namespace_nr;
>> + union {
>> + __u8 set_uuid[16];
>> + __u64 set_magic;
>> + };
>
> This doesn't look like it packs right either.
This is my mimicry from bcache code, which uses the least significant 8
bytes from the randomly generated UUID as a magic number. It is solid
and not changed during the whole life cycle for the nvm pages set.
>
>> +
>> + __u64 flags;
>> + __u64 seq;
>> +
>> + __u64 feature_compat;
>> + __u64 feature_incompat;
>> + __u64 feature_ro_compat;
>> +
>> + /* For allocable nvm pages from buddy systems */
>> + __u64 pages_offset;
>> + __u64 pages_total;
>> +
>> + __u64 pad[8];
>> +
>> + /* Only on the first name space */
>> + struct bch_owner_list_head *owner_list_head;
>
> And here's another pointer...
>
Same reason for I use it as an in-memory pointer.
The above definition is just using all the structures as in-memory
object, the difference is just they are non-volatiled after reboot.
Thanks.
Coly Li
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 08/20] bcache: initialize the nvm pages allocator
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (6 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 07/20] bcache: add initial data structures for nvm pages Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 09/20] bcache: initialization of the buddy Coly Li
` (12 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li
From: Jianpeng Ma <jianpeng.ma@intel.com>
This patch define the prototype data structures in memory and initializes
the nvm pages allocator.
The nv address space which is managed by this allocatior can consist of
many nvm namespaces, and some namespaces can compose into one nvm set,
like cache set. For this initial implementation, only one set can be
supported.
The users of this nvm pages allocator need to call regiseter_namespace()
to register the nvdimm device (like /dev/pmemX) into this allocator as
the instance of struct nvm_namespace.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-authored-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/Kconfig | 6 +
drivers/md/bcache/Makefile | 2 +-
drivers/md/bcache/nvm-pages.c | 404 ++++++++++++++++++++++++++++++++
drivers/md/bcache/nvm-pages.h | 92 ++++++++
drivers/md/bcache/super.c | 3 +
include/uapi/linux/bcache-nvm.h | 7 -
6 files changed, 506 insertions(+), 8 deletions(-)
create mode 100644 drivers/md/bcache/nvm-pages.c
create mode 100644 drivers/md/bcache/nvm-pages.h
diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index d1ca4d059c20..fdec9905ef40 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -35,3 +35,9 @@ config BCACHE_ASYNC_REGISTRATION
device path into this file will returns immediately and the real
registration work is handled in kernel work queue in asynchronous
way.
+
+config BCACHE_NVM_PAGES
+ bool "NVDIMM support for bcache (EXPERIMENTAL)"
+ depends on BCACHE
+ help
+ nvm pages allocator for bcache.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index 5b87e59676b8..948e5ed2ca66 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_BCACHE) += bcache.o
bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\
io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
- util.o writeback.o features.o
+ util.o writeback.o features.o nvm-pages.o
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
new file mode 100644
index 000000000000..4fa8e2764773
--- /dev/null
+++ b/drivers/md/bcache/nvm-pages.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Nvdimm page-buddy allocator
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
+ * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
+ */
+
+#include "bcache.h"
+#include "nvm-pages.h"
+
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
+#include <linux/libnvdimm.h>
+#include <linux/mm_types.h>
+#include <linux/err.h>
+#include <linux/pagemap.h>
+#include <linux/bitmap.h>
+#include <linux/blkdev.h>
+
+#ifdef CONFIG_BCACHE_NVM_PAGES
+
+static const char bch_nvm_pages_magic[] = {
+ 0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
+ 0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
+static const char bch_nvm_pages_pgalloc_magic[] = {
+ 0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
+ 0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
+
+struct bch_nvm_set *only_set;
+
+static struct bch_owner_list *alloc_owner_list(const char *owner_uuid,
+ const char *label, int total_namespaces)
+{
+ struct bch_owner_list *owner_list;
+
+ owner_list = kzalloc(sizeof(*owner_list), GFP_KERNEL);
+ if (!owner_list)
+ return NULL;
+
+ owner_list->alloced_recs = kcalloc(total_namespaces,
+ sizeof(struct bch_nvm_alloced_recs *), GFP_KERNEL);
+ if (!owner_list->alloced_recs) {
+ kfree(owner_list);
+ return NULL;
+ }
+
+ if (owner_uuid)
+ memcpy(owner_list->owner_uuid, owner_uuid, 16);
+ if (label)
+ memcpy(owner_list->label, label, BCH_NVM_PAGES_LABEL_SIZE);
+
+ return owner_list;
+}
+
+static void release_extents(struct bch_nvm_alloced_recs *extents)
+{
+ struct list_head *list = extents->extent_head.next;
+ struct bch_extent *extent;
+
+ while (list != &extents->extent_head) {
+ extent = container_of(list, struct bch_extent, list);
+ list_del(list);
+ kfree(extent);
+ list = extents->extent_head.next;
+ }
+ kfree(extents);
+}
+
+static void release_owner_info(struct bch_nvm_set *nvm_set)
+{
+ struct bch_owner_list *owner_list;
+ int i, j;
+
+ for (i = 0; i < nvm_set->owner_list_used; i++) {
+ owner_list = nvm_set->owner_lists[i];
+ for (j = 0; j < nvm_set->total_namespaces_nr; j++) {
+ if (owner_list->alloced_recs[j])
+ release_extents(owner_list->alloced_recs[j]);
+ }
+ kfree(owner_list->alloced_recs);
+ kfree(owner_list);
+ }
+ kfree(nvm_set->owner_lists);
+}
+
+static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
+{
+ int i;
+
+ for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
+ blkdev_put(nvm_set->nss[i]->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
+ kfree(nvm_set->nss[i]);
+ }
+
+ kfree(nvm_set->nss);
+}
+
+static void release_nvm_set(struct bch_nvm_set *nvm_set)
+{
+ release_nvm_namespaces(nvm_set);
+ release_owner_info(nvm_set);
+ kfree(nvm_set);
+}
+
+static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff)
+{
+ return ns->kaddr + (pgoff << PAGE_SHIFT);
+}
+
+static int init_owner_info(struct bch_nvm_namespace *ns)
+{
+ struct bch_owner_list_head *owner_list_head;
+ struct bch_nvm_pages_owner_head *owner_head;
+ struct bch_nvm_pgalloc_recs *nvm_pgalloc_recs;
+ struct bch_owner_list *owner_list;
+ struct bch_nvm_alloced_recs *extents;
+ struct bch_extent *extent;
+ u32 i, j, k;
+
+ owner_list_head = (struct bch_owner_list_head *)
+ (ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET);
+
+ mutex_lock(&only_set->lock);
+ only_set->owner_list_size = owner_list_head->size;
+ only_set->owner_list_used = owner_list_head->used;
+
+ for (i = 0; i < owner_list_head->used; i++) {
+ owner_head = &owner_list_head->heads[i];
+ owner_list = alloc_owner_list(owner_head->uuid, owner_head->label,
+ only_set->total_namespaces_nr);
+ if (!owner_list) {
+ mutex_unlock(&only_set->lock);
+ return -ENOMEM;
+ }
+
+ for (j = 0; j < only_set->total_namespaces_nr; j++) {
+ if (!only_set->nss[j] || !owner_head->recs[j])
+ continue;
+
+ nvm_pgalloc_recs = (struct bch_nvm_pgalloc_recs *)
+ ((long)owner_head->recs[j] + ns->kaddr);
+ if (memcmp(nvm_pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) {
+ pr_info("invalid bch_nvmpages_pgalloc_magic\n");
+ mutex_unlock(&only_set->lock);
+ return -EINVAL;
+ }
+
+ extents = kzalloc(sizeof(*extents), GFP_KERNEL);
+ if (!extents) {
+ mutex_unlock(&only_set->lock);
+ return -ENOMEM;
+ }
+
+ extents->ns = only_set->nss[j];
+ INIT_LIST_HEAD(&extents->extent_head);
+ owner_list->alloced_recs[j] = extents;
+
+ do {
+ struct bch_pgalloc_rec *rec;
+
+ for (k = 0; k < nvm_pgalloc_recs->used; k++) {
+ rec = &nvm_pgalloc_recs->recs[k];
+ extent = kzalloc(sizeof(*extent), GFP_KERNEL);
+ if (!extents) {
+ mutex_unlock(&only_set->lock);
+ return -ENOMEM;
+ }
+ extent->kaddr = nvm_pgoff_to_vaddr(extents->ns, rec->pgoff);
+ extent->nr = rec->nr;
+ list_add_tail(&extent->list, &extents->extent_head);
+ }
+ extents->nr += nvm_pgalloc_recs->used;
+
+ if (nvm_pgalloc_recs->next) {
+ nvm_pgalloc_recs = (struct bch_nvm_pgalloc_recs *)
+ ((long)nvm_pgalloc_recs->next + ns->kaddr);
+ if (memcmp(nvm_pgalloc_recs->magic,
+ bch_nvm_pages_pgalloc_magic, 16)) {
+ pr_info("invalid bch_nvmpages_pgalloc_magic\n");
+ mutex_unlock(&only_set->lock);
+ return -EINVAL;
+ }
+ } else
+ nvm_pgalloc_recs = NULL;
+ } while (nvm_pgalloc_recs);
+ }
+ only_set->owner_lists[i] = owner_list;
+ owner_list->nvm_set = only_set;
+ }
+ mutex_unlock(&only_set->lock);
+
+ return 0;
+}
+
+static bool attach_nvm_set(struct bch_nvm_namespace *ns)
+{
+ bool rc = true;
+
+ mutex_lock(&only_set->lock);
+ if (only_set->nss) {
+ if (memcmp(ns->sb.set_uuid, only_set->set_uuid, 16)) {
+ pr_info("namespace id does't match nvm set\n");
+ rc = false;
+ goto unlock;
+ }
+
+ if (only_set->nss[ns->sb.this_namespace_nr]) {
+ pr_info("already has the same position(%d) nvm\n",
+ ns->sb.this_namespace_nr);
+ rc = false;
+ goto unlock;
+ }
+ } else {
+ memcpy(only_set->set_uuid, ns->sb.set_uuid, 16);
+ only_set->total_namespaces_nr = ns->sb.total_namespaces_nr;
+ only_set->nss = kcalloc(only_set->total_namespaces_nr,
+ sizeof(struct bch_nvm_namespace *), GFP_KERNEL);
+ only_set->owner_lists = kcalloc(BCH_MAX_OWNER_LIST,
+ sizeof(struct nvm_pages_owner_head *), GFP_KERNEL);
+ if (!only_set->nss || !only_set->owner_lists) {
+ pr_info("can't alloc nss or owner_list\n");
+ kfree(only_set->nss);
+ kfree(only_set->owner_lists);
+ rc = false;
+ goto unlock;
+ }
+ }
+
+ only_set->nss[ns->sb.this_namespace_nr] = ns;
+
+unlock:
+ mutex_unlock(&only_set->lock);
+ return rc;
+}
+
+static int read_nvdimm_meta_super(struct block_device *bdev,
+ struct bch_nvm_namespace *ns)
+{
+ struct page *page;
+ struct bch_nvm_pages_sb *sb;
+
+ page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
+ BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
+
+ if (IS_ERR(page))
+ return -EIO;
+
+ sb = page_address(page) + offset_in_page(BCH_NVM_PAGES_SB_OFFSET);
+ memcpy(&ns->sb, sb, sizeof(struct bch_nvm_pages_sb));
+
+ put_page(page);
+
+ return 0;
+}
+
+struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
+{
+ struct bch_nvm_namespace *ns;
+ int err;
+ pgoff_t pgoff;
+ char buf[BDEVNAME_SIZE];
+ struct block_device *bdev;
+ uint64_t expected_csum;
+ int id;
+ char *path = NULL;
+
+ path = kstrndup(dev_path, 512, GFP_KERNEL);
+ if (!path) {
+ pr_err("kstrndup failed\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ bdev = blkdev_get_by_path(strim(path),
+ FMODE_READ|FMODE_WRITE|FMODE_EXEC,
+ only_set);
+ if (IS_ERR(bdev)) {
+ pr_info("get %s error\n", dev_path);
+ kfree(path);
+ return ERR_PTR(PTR_ERR(bdev));
+ }
+
+ ns = kmalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
+ if (!ns)
+ goto bdput;
+
+ err = -EIO;
+ if (read_nvdimm_meta_super(bdev, ns)) {
+ pr_info("%s read nvdimm meta super block failed.\n",
+ bdevname(bdev, buf));
+ goto free_ns;
+ }
+
+ if (memcmp(ns->sb.magic, bch_nvm_pages_magic, 16)) {
+ pr_info("invalid bch_nvm_pages_magic\n");
+ goto free_ns;
+ }
+
+ if (ns->sb.sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
+ pr_info("invalid superblock offset\n");
+ goto free_ns;
+ }
+
+ if (ns->sb.total_namespaces_nr != 1) {
+ pr_info("only one nvm device\n");
+ goto free_ns;
+ }
+
+ expected_csum = csum_set(&ns->sb);
+ if (expected_csum != ns->sb.csum) {
+ pr_info("csum is not match with expected one\n");
+ goto free_ns;
+ }
+
+ err = -EOPNOTSUPP;
+ if (!bdev_dax_supported(bdev, ns->sb.page_size)) {
+ pr_info("%s don't support DAX\n", bdevname(bdev, buf));
+ goto free_ns;
+ }
+
+ err = -EINVAL;
+ if (bdev_dax_pgoff(bdev, 0, ns->sb.page_size, &pgoff)) {
+ pr_info("invalid offset of %s\n", bdevname(bdev, buf));
+ goto free_ns;
+ }
+
+ err = -ENOMEM;
+ ns->dax_dev = fs_dax_get_by_bdev(bdev);
+ if (!ns->dax_dev) {
+ pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
+ goto free_ns;
+ }
+
+ err = -EINVAL;
+ id = dax_read_lock();
+ if (dax_direct_access(ns->dax_dev, pgoff, ns->sb.pages_total,
+ &ns->kaddr, &ns->start_pfn) <= 0) {
+ pr_info("dax_direct_access error\n");
+ dax_read_unlock(id);
+ goto free_ns;
+ }
+ dax_read_unlock(id);
+
+
+ err = -EEXIST;
+ if (!attach_nvm_set(ns))
+ goto free_ns;
+
+ ns->page_size = ns->sb.page_size;
+ ns->pages_offset = ns->sb.pages_offset;
+ ns->pages_total = ns->sb.pages_total;
+ ns->free = 0;
+ ns->bdev = bdev;
+ ns->nvm_set = only_set;
+
+ mutex_init(&ns->lock);
+
+ if (ns->sb.this_namespace_nr == 0) {
+ pr_info("only first namespace contain owner info\n");
+ err = init_owner_info(ns);
+ if (err < 0) {
+ pr_info("init_owner_info met error %d\n", err);
+ goto free_ns;
+ }
+ }
+
+ kfree(path);
+ return ns;
+free_ns:
+ kfree(ns);
+bdput:
+ blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
+ kfree(path);
+ return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(bch_register_namespace);
+
+int __init bch_nvm_init(void)
+{
+ only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
+ if (!only_set)
+ return -ENOMEM;
+
+ only_set->total_namespaces_nr = 0;
+ only_set->owner_lists = NULL;
+ only_set->nss = NULL;
+
+ mutex_init(&only_set->lock);
+
+ pr_info("bcache nvm init\n");
+ return 0;
+}
+
+void bch_nvm_exit(void)
+{
+ release_nvm_set(only_set);
+ pr_info("bcache nvm exit\n");
+}
+
+#endif
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
new file mode 100644
index 000000000000..1b10b4b6db0f
--- /dev/null
+++ b/drivers/md/bcache/nvm-pages.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BCACHE_NVM_PAGES_H
+#define _BCACHE_NVM_PAGES_H
+
+#include <linux/bcache-nvm.h>
+
+/*
+ * Bcache NVDIMM in memory data structures
+ */
+
+/*
+ * The following three structures in memory records which page(s) allocated
+ * to which owner. After reboot from power failure, they will be initialized
+ * based on nvm pages superblock in NVDIMM device.
+ */
+struct bch_extent {
+ void *kaddr;
+ u32 nr;
+ struct list_head list;
+};
+
+struct bch_nvm_alloced_recs {
+ u32 nr;
+ struct bch_nvm_namespace *ns;
+ struct list_head extent_head;
+};
+
+struct bch_owner_list {
+ u8 owner_uuid[16];
+ char label[BCH_NVM_PAGES_LABEL_SIZE];
+
+ struct bch_nvm_set *nvm_set;
+ struct bch_nvm_alloced_recs **alloced_recs;
+};
+
+struct bch_nvm_namespace {
+ struct bch_nvm_pages_sb sb;
+ void *kaddr;
+
+ u8 uuid[16];
+ u64 free;
+ u32 page_size;
+ u64 pages_offset;
+ u64 pages_total;
+ pfn_t start_pfn;
+
+ struct dax_device *dax_dev;
+ struct block_device *bdev;
+ struct bch_nvm_set *nvm_set;
+
+ struct mutex lock;
+};
+
+/*
+ * A set of namespaces. Currently only one set can be supported.
+ */
+struct bch_nvm_set {
+ u8 set_uuid[16];
+ u32 total_namespaces_nr;
+
+ u32 owner_list_size;
+ u32 owner_list_used;
+ struct bch_owner_list **owner_lists;
+
+ struct bch_nvm_namespace **nss;
+
+ struct mutex lock;
+};
+extern struct bch_nvm_set *only_set;
+
+#ifdef CONFIG_BCACHE_NVM_PAGES
+
+struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
+int bch_nvm_init(void);
+void bch_nvm_exit(void);
+
+#else
+
+static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
+{
+ return NULL;
+}
+static inline int bch_nvm_init(void)
+{
+ return 0;
+}
+static inline void bch_nvm_exit(void) { }
+
+#endif /* CONFIG_BCACHE_NVM_PAGES */
+
+#endif /* _BCACHE_NVM_PAGES_H */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0228ccb293fc..915f1ea4dfd9 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -14,6 +14,7 @@
#include "request.h"
#include "writeback.h"
#include "features.h"
+#include "nvm-pages.h"
#include <linux/blkdev.h>
#include <linux/debugfs.h>
@@ -2816,6 +2817,7 @@ static void bcache_exit(void)
{
bch_debug_exit();
bch_request_exit();
+ bch_nvm_exit();
if (bcache_kobj)
kobject_put(bcache_kobj);
if (bcache_wq)
@@ -2914,6 +2916,7 @@ static int __init bcache_init(void)
bch_debug_init();
closure_debug_init();
+ bch_nvm_init();
bcache_is_reboot = false;
diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
index 61108bf2a63e..0a6dc4a6e470 100644
--- a/include/uapi/linux/bcache-nvm.h
+++ b/include/uapi/linux/bcache-nvm.h
@@ -99,13 +99,6 @@
#define BCH_NVM_PAGES_SB_VERSION 0
#define BCH_NVM_PAGES_SB_VERSION_MAX 0
-static const char bch_nvm_pages_magic[] = {
- 0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
- 0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
-static const char bch_nvm_pages_pgalloc_magic[] = {
- 0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
- 0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
-
struct bch_pgalloc_rec {
__u32 pgoff;
__u32 nr;
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 09/20] bcache: initialization of the buddy
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (7 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 08/20] bcache: initialize the nvm pages allocator Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 10/20] bcache: bch_nvm_alloc_pages() " Coly Li
` (11 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li
From: Jianpeng Ma <jianpeng.ma@intel.com>
This nvm pages allocator will implement the simple buddy to manage the
nvm address space. This patch initializes this buddy for new namespace.
the unit of alloc/free of the buddy is page. DAX device has their
struct page(in dram or PMEM).
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
* mapped so the next 3 words hold the mapping, index,
* and private fields from the source anonymous or
* page cache page while the page is migrated to device
* private memory.
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
* use the mapping, index, and private fields when
* pmem backed DAX files are mapped.
*/
};
ZONE_DEVICE pages only use pgmap. Other 4 words[16/32 bytes] don't use.
So the second/third word will be used as 'struct list_head ' which list
in buddy. The fourth word(that is normal struct page::index) store pgoff
which the page-offset in the dax device. And the fifth word (that is
normal struct page::private) store order of buddy. page_type will be used
to store buddy flags.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-authored-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/nvm-pages.c | 75 ++++++++++++++++++++++++++++++++++-
drivers/md/bcache/nvm-pages.h | 5 +++
2 files changed, 78 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 4fa8e2764773..7efb99c0fc07 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -93,6 +93,7 @@ static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
int i;
for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
+ kvfree(nvm_set->nss[i]->pages_bitmap);
blkdev_put(nvm_set->nss[i]->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
kfree(nvm_set->nss[i]);
}
@@ -112,6 +113,17 @@ static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff)
return ns->kaddr + (pgoff << PAGE_SHIFT);
}
+static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
+{
+ return virt_to_page(addr);
+}
+
+static inline void remove_owner_space(struct bch_nvm_namespace *ns,
+ pgoff_t pgoff, u32 nr)
+{
+ bitmap_set(ns->pages_bitmap, pgoff, nr);
+}
+
static int init_owner_info(struct bch_nvm_namespace *ns)
{
struct bch_owner_list_head *owner_list_head;
@@ -129,6 +141,8 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
only_set->owner_list_size = owner_list_head->size;
only_set->owner_list_used = owner_list_head->used;
+ remove_owner_space(ns, 0, ns->pages_offset/ns->page_size);
+
for (i = 0; i < owner_list_head->used; i++) {
owner_head = &owner_list_head->heads[i];
owner_list = alloc_owner_list(owner_head->uuid, owner_head->label,
@@ -162,6 +176,8 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
do {
struct bch_pgalloc_rec *rec;
+ int order;
+ struct page *page;
for (k = 0; k < nvm_pgalloc_recs->used; k++) {
rec = &nvm_pgalloc_recs->recs[k];
@@ -172,7 +188,17 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
}
extent->kaddr = nvm_pgoff_to_vaddr(extents->ns, rec->pgoff);
extent->nr = rec->nr;
+ WARN_ON(!is_power_of_2(extent->nr));
+
+ /*init struct page: index/private */
+ order = ilog2(extent->nr);
+ page = nvm_vaddr_to_page(ns, extent->kaddr);
+ set_page_private(page, order);
+ page->index = rec->pgoff;
+
list_add_tail(&extent->list, &extents->extent_head);
+ /*remove already alloced space*/
+ remove_owner_space(extents->ns, rec->pgoff, rec->nr);
}
extents->nr += nvm_pgalloc_recs->used;
@@ -197,6 +223,36 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
return 0;
}
+static void init_nvm_free_space(struct bch_nvm_namespace *ns)
+{
+ unsigned int start, end, i;
+ struct page *page;
+ long long pages;
+ pgoff_t pgoff_start;
+
+ bitmap_for_each_clear_region(ns->pages_bitmap, start, end, 0, ns->pages_total) {
+ pgoff_start = start;
+ pages = end - start;
+
+ while (pages) {
+ for (i = BCH_MAX_ORDER - 1; i >= 0 ; i--) {
+ if ((pgoff_start % (1 << i) == 0) && (pages >= (1 << i)))
+ break;
+ }
+
+ page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start));
+ page->index = pgoff_start;
+ set_page_private(page, i);
+ __SetPageBuddy(page);
+ list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]);
+
+ pgoff_start += 1 << i;
+ pages -= 1 << i;
+ }
+ }
+
+}
+
static bool attach_nvm_set(struct bch_nvm_namespace *ns)
{
bool rc = true;
@@ -261,7 +317,7 @@ static int read_nvdimm_meta_super(struct block_device *bdev,
struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
{
struct bch_nvm_namespace *ns;
- int err;
+ int i, err;
pgoff_t pgoff;
char buf[BDEVNAME_SIZE];
struct block_device *bdev;
@@ -357,6 +413,16 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
ns->bdev = bdev;
ns->nvm_set = only_set;
+ ns->pages_bitmap = kvcalloc(BITS_TO_LONGS(ns->pages_total),
+ sizeof(unsigned long), GFP_KERNEL);
+ if (!ns->pages_bitmap) {
+ err = -ENOMEM;
+ goto free_ns;
+ }
+
+ for (i = 0; i < BCH_MAX_ORDER; i++)
+ INIT_LIST_HEAD(&ns->free_area[i]);
+
mutex_init(&ns->lock);
if (ns->sb.this_namespace_nr == 0) {
@@ -364,12 +430,17 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
err = init_owner_info(ns);
if (err < 0) {
pr_info("init_owner_info met error %d\n", err);
- goto free_ns;
+ goto free_bitmap;
}
+ /* init buddy allocator */
+ init_nvm_free_space(ns);
}
kfree(path);
return ns;
+
+free_bitmap:
+ kvfree(ns->pages_bitmap);
free_ns:
kfree(ns);
bdput:
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index 1b10b4b6db0f..ed3431daae06 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -34,6 +34,7 @@ struct bch_owner_list {
struct bch_nvm_alloced_recs **alloced_recs;
};
+#define BCH_MAX_ORDER 20
struct bch_nvm_namespace {
struct bch_nvm_pages_sb sb;
void *kaddr;
@@ -45,6 +46,10 @@ struct bch_nvm_namespace {
u64 pages_total;
pfn_t start_pfn;
+ unsigned long *pages_bitmap;
+ struct list_head free_area[BCH_MAX_ORDER];
+
+
struct dax_device *dax_dev;
struct block_device *bdev;
struct bch_nvm_set *nvm_set;
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 10/20] bcache: bch_nvm_alloc_pages() of the buddy
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (8 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 09/20] bcache: initialization of the buddy Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 11/20] bcache: bch_nvm_free_pages() " Coly Li
` (10 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li
From: Jianpeng Ma <jianpeng.ma@intel.com>
This patch implements the bch_nvm_alloc_pages() of the buddy.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-authored-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/nvm-pages.c | 121 ++++++++++++++++++++++++++++++++++
drivers/md/bcache/nvm-pages.h | 6 ++
2 files changed, 127 insertions(+)
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 7efb99c0fc07..0b992c17ce47 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -124,6 +124,127 @@ static inline void remove_owner_space(struct bch_nvm_namespace *ns,
bitmap_set(ns->pages_bitmap, pgoff, nr);
}
+/* If not found, it will create if create == true */
+static struct bch_owner_list *find_owner_list(const char *owner_uuid, bool create)
+{
+ struct bch_owner_list *owner_list;
+ int i;
+
+ for (i = 0; i < only_set->owner_list_used; i++) {
+ if (!memcmp(owner_uuid, only_set->owner_lists[i]->owner_uuid, 16))
+ return only_set->owner_lists[i];
+ }
+
+ if (create) {
+ owner_list = alloc_owner_list(owner_uuid, NULL, only_set->total_namespaces_nr);
+ only_set->owner_lists[only_set->owner_list_used++] = owner_list;
+ return owner_list;
+ } else
+ return NULL;
+}
+
+static struct bch_nvm_alloced_recs *find_nvm_alloced_recs(struct bch_owner_list *owner_list,
+ struct bch_nvm_namespace *ns, bool create)
+{
+ int position = ns->sb.this_namespace_nr;
+
+ if (create && !owner_list->alloced_recs[position]) {
+ struct bch_nvm_alloced_recs *alloced_recs =
+ kzalloc(sizeof(*alloced_recs), GFP_KERNEL|__GFP_NOFAIL);
+
+ alloced_recs->ns = ns;
+ INIT_LIST_HEAD(&alloced_recs->extent_head);
+ owner_list->alloced_recs[position] = alloced_recs;
+ return alloced_recs;
+ } else
+ return owner_list->alloced_recs[position];
+}
+
+static inline void *extent_end_addr(struct bch_extent *extent)
+{
+ return extent->kaddr + ((u64)(extent->nr) << PAGE_SHIFT);
+}
+
+static void add_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order)
+{
+ struct list_head *list = alloced_recs->extent_head.next;
+ struct bch_extent *extent, *tmp;
+ void *end_addr = addr + (((u64)1 << order) << PAGE_SHIFT);
+
+ while (list != &alloced_recs->extent_head) {
+ extent = container_of(list, struct bch_extent, list);
+ if (addr > extent->kaddr) {
+ list = list->next;
+ continue;
+ }
+ break;
+ }
+
+ extent = kzalloc(sizeof(*extent), GFP_KERNEL);
+ extent->kaddr = addr;
+ extent->nr = 1 << order;
+ list_add_tail(&extent->list, list);
+ alloced_recs->nr++;
+}
+
+void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
+{
+ void *kaddr = NULL;
+ struct bch_owner_list *owner_list;
+ struct bch_nvm_alloced_recs *alloced_recs;
+ int i, j;
+
+ mutex_lock(&only_set->lock);
+ owner_list = find_owner_list(owner_uuid, true);
+
+ for (j = 0; j < only_set->total_namespaces_nr; j++) {
+ struct bch_nvm_namespace *ns = only_set->nss[j];
+
+ if (!ns || (ns->free < (1 << order)))
+ continue;
+
+ for (i = order; i < BCH_MAX_ORDER; i++) {
+ struct list_head *list;
+ struct page *page, *buddy_page;
+
+ if (list_empty(&ns->free_area[i]))
+ continue;
+
+ list = ns->free_area[i].next;
+ page = container_of((void *)list, struct page, zone_device_data);
+
+ list_del(list);
+
+ while (i != order) {
+ buddy_page = nvm_vaddr_to_page(ns,
+ nvm_pgoff_to_vaddr(ns, page->index + (1 << (i - 1))));
+ set_page_private(buddy_page, i - 1);
+ buddy_page->index = page->index + (1 << (i - 1));
+ __SetPageBuddy(buddy_page);
+ list_add((struct list_head *)&buddy_page->zone_device_data,
+ &ns->free_area[i - 1]);
+ i--;
+ }
+
+ set_page_private(page, order);
+ __ClearPageBuddy(page);
+ ns->free -= 1 << order;
+ kaddr = nvm_pgoff_to_vaddr(ns, page->index);
+ break;
+ }
+
+ if (i != BCH_MAX_ORDER) {
+ alloced_recs = find_nvm_alloced_recs(owner_list, ns, true);
+ add_extent(alloced_recs, kaddr, order);
+ break;
+ }
+ }
+
+ mutex_unlock(&only_set->lock);
+ return kaddr;
+}
+EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
+
static int init_owner_info(struct bch_nvm_namespace *ns)
{
struct bch_owner_list_head *owner_list_head;
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index ed3431daae06..10157d993126 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -79,6 +79,7 @@ extern struct bch_nvm_set *only_set;
struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
int bch_nvm_init(void);
void bch_nvm_exit(void);
+void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
#else
@@ -92,6 +93,11 @@ static inline int bch_nvm_init(void)
}
static inline void bch_nvm_exit(void) { }
+static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
+{
+ return NULL;
+}
+
#endif /* CONFIG_BCACHE_NVM_PAGES */
#endif /* _BCACHE_NVM_PAGES_H */
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 11/20] bcache: bch_nvm_free_pages() of the buddy
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (9 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 10/20] bcache: bch_nvm_alloc_pages() " Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 12/20] bcache: get allocated pages from specific owner Coly Li
` (9 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li
From: Jianpeng Ma <jianpeng.ma@intel.com>
This patch implements the bch_nvm_free_pages() of the buddy.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-authored-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/nvm-pages.c | 143 ++++++++++++++++++++++++++++++++--
drivers/md/bcache/nvm-pages.h | 3 +
2 files changed, 138 insertions(+), 8 deletions(-)
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 0b992c17ce47..b40bdbac873f 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -168,8 +168,7 @@ static inline void *extent_end_addr(struct bch_extent *extent)
static void add_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order)
{
struct list_head *list = alloced_recs->extent_head.next;
- struct bch_extent *extent, *tmp;
- void *end_addr = addr + (((u64)1 << order) << PAGE_SHIFT);
+ struct bch_extent *extent;
while (list != &alloced_recs->extent_head) {
extent = container_of(list, struct bch_extent, list);
@@ -187,6 +186,136 @@ static void add_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, in
alloced_recs->nr++;
}
+static inline void *nvm_end_addr(struct bch_nvm_namespace *ns)
+{
+ return ns->kaddr + (ns->pages_total << PAGE_SHIFT);
+}
+
+static inline bool in_nvm_range(struct bch_nvm_namespace *ns,
+ void *start_addr, void *end_addr)
+{
+ return (start_addr >= ns->kaddr) && (end_addr <= nvm_end_addr(ns));
+}
+
+static struct bch_nvm_namespace *find_nvm_by_addr(void *addr, int order)
+{
+ int i;
+ struct bch_nvm_namespace *ns;
+
+ for (i = 0; i < only_set->total_namespaces_nr; i++) {
+ ns = only_set->nss[i];
+ if (ns && in_nvm_range(ns, addr, addr + (1 << order)))
+ return ns;
+ }
+ return NULL;
+}
+
+static int remove_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order)
+{
+ struct list_head *list = alloced_recs->extent_head.next;
+ struct bch_extent *extent;
+
+ while (list != &alloced_recs->extent_head) {
+ extent = container_of(list, struct bch_extent, list);
+
+ if (addr < extent->kaddr)
+ return -ENOENT;
+ if (addr > extent->kaddr) {
+ list = list->next;
+ continue;
+ }
+
+ WARN_ON(extent->nr != (1 << order));
+ list_del(list);
+ kfree(extent);
+ alloced_recs->nr--;
+ break;
+ }
+ return (list == &alloced_recs->extent_head) ? -ENOENT : 0;
+}
+
+static void __free_space(struct bch_nvm_namespace *ns, void *addr, int order)
+{
+ unsigned int add_pages = (1 << order);
+ pgoff_t pgoff;
+ struct page *page;
+
+ page = nvm_vaddr_to_page(ns, addr);
+ WARN_ON((!page) || (page->private != order));
+ pgoff = page->index;
+
+ while (order < BCH_MAX_ORDER - 1) {
+ struct page *buddy_page;
+
+ pgoff_t buddy_pgoff = pgoff ^ (1 << order);
+ pgoff_t parent_pgoff = pgoff & ~(1 << order);
+
+ if ((parent_pgoff + (1 << (order + 1)) > ns->pages_total))
+ break;
+
+ buddy_page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, buddy_pgoff));
+ WARN_ON(!buddy_page);
+
+ if (PageBuddy(buddy_page) && (buddy_page->private == order)) {
+ list_del((struct list_head *)&buddy_page->zone_device_data);
+ __ClearPageBuddy(buddy_page);
+ pgoff = parent_pgoff;
+ order++;
+ continue;
+ }
+ break;
+ }
+
+ page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff));
+ WARN_ON(!page);
+ list_add((struct list_head *)&page->zone_device_data, &ns->free_area[order]);
+ page->index = pgoff;
+ set_page_private(page, order);
+ __SetPageBuddy(page);
+ ns->free += add_pages;
+}
+
+void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid)
+{
+ struct bch_nvm_namespace *ns;
+ struct bch_owner_list *owner_list;
+ struct bch_nvm_alloced_recs *alloced_recs;
+ int r;
+
+ mutex_lock(&only_set->lock);
+
+ ns = find_nvm_by_addr(addr, order);
+ if (!ns) {
+ pr_info("can't find nvm_dev by kaddr %p\n", addr);
+ goto unlock;
+ }
+
+ owner_list = find_owner_list(owner_uuid, false);
+ if (!owner_list) {
+ pr_info("can't found owner(uuid=%s)\n", owner_uuid);
+ goto unlock;
+ }
+
+ alloced_recs = find_nvm_alloced_recs(owner_list, ns, false);
+ if (!alloced_recs) {
+ pr_info("can't find alloced_recs(uuid=%s)\n", ns->uuid);
+ goto unlock;
+ }
+
+ r = remove_extent(alloced_recs, addr, order);
+ if (r < 0) {
+ pr_info("can't find extent\n");
+ goto unlock;
+ }
+
+ __free_space(ns, addr, order);
+
+unlock:
+ mutex_unlock(&only_set->lock);
+}
+EXPORT_SYMBOL_GPL(bch_nvm_free_pages);
+
+
void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
{
void *kaddr = NULL;
@@ -276,7 +405,6 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
for (j = 0; j < only_set->total_namespaces_nr; j++) {
if (!only_set->nss[j] || !owner_head->recs[j])
continue;
-
nvm_pgalloc_recs = (struct bch_nvm_pgalloc_recs *)
((long)owner_head->recs[j] + ns->kaddr);
if (memcmp(nvm_pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) {
@@ -348,7 +476,7 @@ static void init_nvm_free_space(struct bch_nvm_namespace *ns)
{
unsigned int start, end, i;
struct page *page;
- long long pages;
+ u64 pages;
pgoff_t pgoff_start;
bitmap_for_each_clear_region(ns->pages_bitmap, start, end, 0, ns->pages_total) {
@@ -364,9 +492,8 @@ static void init_nvm_free_space(struct bch_nvm_namespace *ns)
page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start));
page->index = pgoff_start;
set_page_private(page, i);
- __SetPageBuddy(page);
- list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]);
-
+ /* in order to update ns->free */
+ __free_space(ns, nvm_pgoff_to_vaddr(ns, pgoff_start), i);
pgoff_start += 1 << i;
pages -= 1 << i;
}
@@ -530,7 +657,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
ns->page_size = ns->sb.page_size;
ns->pages_offset = ns->sb.pages_offset;
ns->pages_total = ns->sb.pages_total;
- ns->free = 0;
+ ns->free = 0; /* increased by __free_space() */
ns->bdev = bdev;
ns->nvm_set = only_set;
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index 10157d993126..1bc3129f2482 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -80,6 +80,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
int bch_nvm_init(void);
void bch_nvm_exit(void);
void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
+void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
#else
@@ -98,6 +99,8 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
return NULL;
}
+static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
+
#endif /* CONFIG_BCACHE_NVM_PAGES */
#endif /* _BCACHE_NVM_PAGES_H */
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 12/20] bcache: get allocated pages from specific owner
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (10 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 11/20] bcache: bch_nvm_free_pages() " Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 13/20] bcache: persist owner info when alloc/free pages Coly Li
` (8 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li
From: Jianpeng Ma <jianpeng.ma@intel.com>
This patch implements bch_get_allocated_pages() of the buddy to be used to
get allocated pages from specific owner.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-authored-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/nvm-pages.c | 39 +++++++++++++++++++++++++++++++++++
drivers/md/bcache/nvm-pages.h | 6 ++++++
2 files changed, 45 insertions(+)
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index b40bdbac873f..2b079a277e88 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -374,6 +374,45 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
}
EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
+struct bch_extent *bch_get_allocated_pages(const char *owner_uuid)
+{
+ struct bch_owner_list *owner_list = find_owner_list(owner_uuid, false);
+ struct bch_nvm_alloced_recs *alloced_recs;
+ struct bch_extent *head = NULL, *e, *tmp;
+ int i;
+
+ if (!owner_list)
+ return NULL;
+
+ for (i = 0; i < only_set->total_namespaces_nr; i++) {
+ struct list_head *l;
+
+ alloced_recs = owner_list->alloced_recs[i];
+
+ if (!alloced_recs || alloced_recs->nr == 0)
+ continue;
+
+ l = alloced_recs->extent_head.next;
+ while (l != &alloced_recs->extent_head) {
+ e = container_of(l, struct bch_extent, list);
+ tmp = kzalloc(sizeof(*tmp), GFP_KERNEL|__GFP_NOFAIL);
+
+ INIT_LIST_HEAD(&tmp->list);
+ tmp->kaddr = e->kaddr;
+ tmp->nr = e->nr;
+
+ if (head)
+ list_add_tail(&tmp->list, &head->list);
+ else
+ head = tmp;
+
+ l = l->next;
+ }
+ }
+ return head;
+}
+EXPORT_SYMBOL_GPL(bch_get_allocated_pages);
+
static int init_owner_info(struct bch_nvm_namespace *ns)
{
struct bch_owner_list_head *owner_list_head;
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index 1bc3129f2482..8ffae11c7c61 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -81,6 +81,7 @@ int bch_nvm_init(void);
void bch_nvm_exit(void);
void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
+struct bch_extent *bch_get_allocated_pages(const char *owner_uuid);
#else
@@ -101,6 +102,11 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
+static inline struct bch_extent *bch_get_allocated_pages(const char *owner_uuid)
+{
+ return NULL;
+}
+
#endif /* CONFIG_BCACHE_NVM_PAGES */
#endif /* _BCACHE_NVM_PAGES_H */
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 13/20] bcache: persist owner info when alloc/free pages.
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (11 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 12/20] bcache: get allocated pages from specific owner Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 14/20] bcache: use bucket index for SET_GC_MARK() in bch_btree_gc_finish() Coly Li
` (7 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li
From: Jianpeng Ma <jianpeng.ma@intel.com>
This patch implement persist owner info on nvdimm device
when alloc/free pages.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-authored-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/bcache/nvm-pages.c | 93 ++++++++++++++++++++++++++++++++++-
1 file changed, 92 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 2b079a277e88..c350dcd696dd 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -210,6 +210,19 @@ static struct bch_nvm_namespace *find_nvm_by_addr(void *addr, int order)
return NULL;
}
+static void init_pgalloc_recs(struct bch_nvm_pgalloc_recs *recs, const char *owner_uuid)
+{
+ memset(recs, 0, sizeof(struct bch_nvm_pgalloc_recs));
+ memcpy(recs->magic, bch_nvm_pages_pgalloc_magic, 16);
+ memcpy(recs->owner_uuid, owner_uuid, 16);
+ recs->size = BCH_MAX_RECS;
+}
+
+static pgoff_t vaddr_to_nvm_pgoff(struct bch_nvm_namespace *ns, void *kaddr)
+{
+ return (kaddr - ns->kaddr) / PAGE_SIZE;
+}
+
static int remove_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr, int order)
{
struct list_head *list = alloced_recs->extent_head.next;
@@ -234,6 +247,82 @@ static int remove_extent(struct bch_nvm_alloced_recs *alloced_recs, void *addr,
return (list == &alloced_recs->extent_head) ? -ENOENT : 0;
}
+#define BCH_RECS_LEN (sizeof(struct bch_nvm_pgalloc_recs))
+
+static void write_owner_info(void)
+{
+ struct bch_owner_list *owner_list;
+ struct bch_nvm_pgalloc_recs *recs;
+ struct bch_nvm_namespace *ns = only_set->nss[0];
+ struct bch_owner_list_head *owner_list_head;
+ struct bch_nvm_pages_owner_head *owner_head;
+ u64 recs_pos = BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
+ struct list_head *list;
+ int i, j;
+
+ owner_list_head = kzalloc(sizeof(*owner_list_head), GFP_KERNEL);
+ recs = kmalloc(sizeof(*recs), GFP_KERNEL);
+ if (!owner_list_head || !recs) {
+ pr_info("can't alloc memory\n");
+ goto free_resouce;
+ }
+
+ owner_list_head->size = BCH_MAX_OWNER_LIST;
+ WARN_ON(only_set->owner_list_used > owner_list_head->size);
+
+ // in-memory owner maybe not contain alloced-pages.
+ for (i = 0; i < only_set->owner_list_used; i++) {
+ owner_head = &owner_list_head->heads[i];
+ owner_list = only_set->owner_lists[i];
+
+ memcpy(owner_head->uuid, owner_list->owner_uuid, 16);
+
+ for (j = 0; j < only_set->total_namespaces_nr; j++) {
+ struct bch_nvm_alloced_recs *extents = owner_list->alloced_recs[j];
+
+ if (!extents || !extents->nr)
+ continue;
+
+ init_pgalloc_recs(recs, owner_list->owner_uuid);
+
+ BUG_ON(recs_pos >= BCH_NVM_PAGES_OFFSET);
+ owner_head->recs[j] = (struct bch_nvm_pgalloc_recs *)(uintptr_t)recs_pos;
+
+ for (list = extents->extent_head.next;
+ list != &extents->extent_head;
+ list = list->next) {
+ struct bch_extent *extent;
+
+ extent = container_of(list, struct bch_extent, list);
+
+ if (recs->used == recs->size) {
+ BUG_ON(recs_pos >= BCH_NVM_PAGES_OFFSET);
+ recs->next = (struct bch_nvm_pgalloc_recs *)
+ (uintptr_t)(recs_pos + BCH_RECS_LEN);
+ memcpy_flushcache(ns->kaddr + recs_pos, recs, BCH_RECS_LEN);
+ init_pgalloc_recs(recs, owner_list->owner_uuid);
+ recs_pos += BCH_RECS_LEN;
+ }
+
+ recs->recs[recs->used].pgoff =
+ vaddr_to_nvm_pgoff(only_set->nss[j], extent->kaddr);
+ recs->recs[recs->used].nr = extent->nr;
+ recs->used++;
+ }
+
+ memcpy_flushcache(ns->kaddr + recs_pos, recs, BCH_RECS_LEN);
+ recs_pos += sizeof(struct bch_nvm_pgalloc_recs);
+ }
+ }
+
+ owner_list_head->used = only_set->owner_list_used;
+ memcpy_flushcache(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET,
+ (void *)owner_list_head, sizeof(struct bch_owner_list_head));
+free_resouce:
+ kfree(owner_list_head);
+ kfree(recs);
+}
+
static void __free_space(struct bch_nvm_namespace *ns, void *addr, int order)
{
unsigned int add_pages = (1 << order);
@@ -309,6 +398,7 @@ void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid)
}
__free_space(ns, addr, order);
+ write_owner_info();
unlock:
mutex_unlock(&only_set->lock);
@@ -368,7 +458,8 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
break;
}
}
-
+ if (kaddr)
+ write_owner_info();
mutex_unlock(&only_set->lock);
return kaddr;
}
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 14/20] bcache: use bucket index for SET_GC_MARK() in bch_btree_gc_finish()
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (12 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 13/20] bcache: persist owner info when alloc/free pages Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 15/20] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
` (6 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
Currently the meta data bucket locations on cache device are reserved
after the meta data stored on NVDIMM pages, for the meta data layout
consistentcy temporarily. So these buckets are still marked as meta data
by SET_GC_MARK() in bch_btree_gc_finish().
When BCH_FEATURE_INCOMPAT_NVDIMM_META is set, the sb.d[] stores linear
address of NVDIMM pages and not bucket index anymore. Therefore we
should avoid to find bucket index from sb.d[], and directly use bucket
index from ca->sb.first_bucket to (ca->sb.first_bucket +
ca->sb.njournal_bucketsi) for setting the gc mark of journal bucket.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/btree.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index fe6dce125aba..28edd884bd5d 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1761,8 +1761,10 @@ static void bch_btree_gc_finish(struct cache_set *c)
ca = c->cache;
ca->invalidate_needs_gc = 0;
- for (k = ca->sb.d; k < ca->sb.d + ca->sb.keys; k++)
- SET_GC_MARK(ca->buckets + *k, GC_MARK_METADATA);
+ /* Range [first_bucket, first_bucket + keys) is for journal buckets */
+ for (i = ca->sb.first_bucket;
+ i < ca->sb.first_bucket + ca->sb.njournal_buckets; i++)
+ SET_GC_MARK(ca->buckets + i, GC_MARK_METADATA);
for (k = ca->prio_buckets;
k < ca->prio_buckets + prio_buckets(ca) * 2; k++)
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 15/20] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (13 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 14/20] bcache: use bucket index for SET_GC_MARK() in bch_btree_gc_finish() Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 16/20] bcache: initialize bcache journal for NVDIMM meta device Coly Li
` (5 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
This patch adds BCH_FEATURE_INCOMPAT_NVDIMM_META (value 0x0004) into the
incompat feature set. When this bit is set by bcache-tools, it indicates
bcache meta data should be stored on specific NVDIMM meta device.
The bcache meta data mainly includes journal and btree nodes, when this
bit is set in incompat feature set, bcache will ask the nvm-pages
allocator for NVDIMM space to store the meta data.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/features.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
index d1c8fd3977fc..333fb5efb6bd 100644
--- a/drivers/md/bcache/features.h
+++ b/drivers/md/bcache/features.h
@@ -17,11 +17,19 @@
#define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET 0x0001
/* real bucket size is (1 << bucket_size) */
#define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE 0x0002
+/* store bcache meta data on nvdimm */
+#define BCH_FEATURE_INCOMPAT_NVDIMM_META 0x0004
#define BCH_FEATURE_COMPAT_SUPP 0
#define BCH_FEATURE_RO_COMPAT_SUPP 0
+#ifdef CONFIG_BCACHE_NVM_PAGES
+#define BCH_FEATURE_INCOMPAT_SUPP (BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
+ BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE| \
+ BCH_FEATURE_INCOMPAT_NVDIMM_META)
+#else
#define BCH_FEATURE_INCOMPAT_SUPP (BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE)
+#endif
#define BCH_HAS_COMPAT_FEATURE(sb, mask) \
((sb)->feature_compat & (mask))
@@ -89,6 +97,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET);
BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE);
+BCH_FEATURE_INCOMPAT_FUNCS(nvdimm_meta, NVDIMM_META);
static inline bool bch_has_unknown_compat_features(struct cache_sb *sb)
{
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 16/20] bcache: initialize bcache journal for NVDIMM meta device
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (14 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 15/20] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 17/20] bcache: support storing bcache journal into " Coly Li
` (4 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
The nvm-pages allocator may store and index the NVDIMM pages allocated
for bcache journal. This patch adds the initialization to store bcache
journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META bit is
set by bcache-tools.
If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space()
will return the linear address of NVDIMM pages for bcache journal,
- If there is previously allocated space, find it from nvm-pages owner
list and return to bch_journal_init().
- If there is no previously allocated space, require a new NVDIMM range
from the nvm-pages allocator, and return it to bch_journal_init().
And in bch_journal_init(), keys in sb.d[] store the corresponding linear
address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket index to
iterate all journal buckets.
Later when bcache journaling code stores the journaling jset, the target
NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be used
directly in memory copy from DRAM pages into NVDIMM pages.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/journal.c | 97 +++++++++++++++++++++++++++++++++++++
drivers/md/bcache/journal.h | 2 +-
drivers/md/bcache/super.c | 16 +++---
3 files changed, 107 insertions(+), 8 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index c6613e817333..1f16d8e497cf 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -9,6 +9,8 @@
#include "btree.h"
#include "debug.h"
#include "extents.h"
+#include "nvm-pages.h"
+#include "features.h"
#include <trace/events/bcache.h>
@@ -982,3 +984,98 @@ int bch_journal_alloc(struct cache_set *c)
return 0;
}
+
+static void *find_journal_nvm_base(struct bch_extent *list, struct cache *ca)
+{
+ void *ret = NULL;
+ struct bch_extent *cur, *next;
+
+ next = list;
+ do {
+ cur = next;
+ /* Match journal area's nvdimm address */
+ if (cur->kaddr == (void *)ca->sb.d[0]) {
+ ret = cur->kaddr;
+ break;
+ }
+ next = list_entry(cur->list.next, struct bch_extent, list);
+ } while (next != list);
+
+ return ret;
+}
+
+static void bch_release_nvm_extent_list(struct bch_extent *list)
+{
+ struct bch_extent *ext;
+ struct list_head *cur, *next;
+
+ list_for_each_safe(cur, next, &list->list) {
+ ext = list_entry(cur, struct bch_extent, list);
+ kfree(ext);
+ }
+}
+
+static void *get_nvdimm_journal_space(struct cache *ca)
+{
+ struct bch_extent *allocated_list = NULL;
+ void *ret = NULL;
+
+ allocated_list = bch_get_allocated_pages(ca->sb.set_uuid);
+ if (allocated_list) {
+ ret = find_journal_nvm_base(allocated_list, ca);
+ bch_release_nvm_extent_list(allocated_list);
+ }
+
+ if (!ret) {
+ int order = ilog2(ca->sb.bucket_size * ca->sb.njournal_buckets /
+ PAGE_SECTORS);
+
+ ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid);
+ if (ret)
+ memset(ret, 0, (1 << order) * PAGE_SIZE);
+ }
+
+ return ret;
+}
+
+static int __bch_journal_nvdimm_init(struct cache *ca)
+{
+ int i, ret = 0;
+ void *journal_nvm_base = NULL;
+
+ journal_nvm_base = get_nvdimm_journal_space(ca);
+ if (!journal_nvm_base) {
+ pr_err("Failed to get journal space from nvdimm\n");
+ ret = -1;
+ goto out;
+ }
+
+ /* Iniialized and reloaded from on-disk super block already */
+ if (ca->sb.d[0] != 0)
+ goto out;
+
+ for (i = 0; i < ca->sb.keys; i++)
+ ca->sb.d[i] =
+ (u64)(journal_nvm_base + (ca->sb.bucket_size * i));
+
+out:
+ return ret;
+}
+
+int bch_journal_init(struct cache_set *c)
+{
+ int i, ret = 0;
+ struct cache *ca = c->cache;
+
+ ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
+ 2, SB_JOURNAL_BUCKETS);
+
+ if (!bch_has_feature_nvdimm_meta(&ca->sb)) {
+ for (i = 0; i < ca->sb.keys; i++)
+ ca->sb.d[i] = ca->sb.first_bucket + i;
+ } else {
+ ret = __bch_journal_nvdimm_init(ca);
+ }
+
+ return ret;
+}
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..e3a7fa5a8fda 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list);
void bch_journal_meta(struct cache_set *c, struct closure *cl);
int bch_journal_read(struct cache_set *c, struct list_head *list);
int bch_journal_replay(struct cache_set *c, struct list_head *list);
-
+int bch_journal_init(struct cache_set *c);
void bch_journal_free(struct cache_set *c);
int bch_journal_alloc(struct cache_set *c);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 915f1ea4dfd9..57c96c16ee16 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -146,10 +146,15 @@ static const char *read_super_common(struct cache_sb *sb, struct block_device *
goto err;
err = "Journal buckets not sequential";
+#ifdef CONFIG_BCACHE_NVM_PAGES
+ if (!bch_has_feature_nvdimm_meta(sb)) {
+#endif
for (i = 0; i < sb->keys; i++)
if (sb->d[i] != sb->first_bucket + i)
goto err;
-
+#ifdef CONFIG_BCACHE_NVM_PAGES
+ } /* bch_has_feature_nvdimm_meta */
+#endif
err = "Too many journal buckets";
if (sb->first_bucket + sb->keys > sb->nbuckets)
goto err;
@@ -2072,14 +2077,11 @@ static int run_cache_set(struct cache_set *c)
if (bch_journal_replay(c, &journal))
goto err;
} else {
- unsigned int j;
-
pr_notice("invalidating existing data\n");
- ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
- 2, SB_JOURNAL_BUCKETS);
- for (j = 0; j < ca->sb.keys; j++)
- ca->sb.d[j] = ca->sb.first_bucket + j;
+ err = "error initializing journal";
+ if (bch_journal_init(c))
+ goto err;
bch_initial_gc_finish(c);
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 17/20] bcache: support storing bcache journal into NVDIMM meta device
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (15 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 16/20] bcache: initialize bcache journal for NVDIMM meta device Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-18 21:21 ` Nix
2021-02-10 5:07 ` [PATCH 18/20] bcache: read jset from NVDIMM pages for journal replay Coly Li
` (3 subsequent siblings)
20 siblings, 1 reply; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
This patch implements two methods to store bcache journal to,
1) __journal_write_unlocked() for block interface device
The latency method to compose bio and issue the jset bio to cache
device (e.g. SSD). c->journal.key.ptr[0] indicates the LBA on cache
device to store the journal jset.
2) __journal_nvdimm_write_unlocked() for memory interface NVDIMM
Use memory interface to access NVDIMM pages and store the jset by
memcpy_flushcache(). c->journal.key.ptr[0] indicates the linear
address from the NVDIMM pages to store the journal jset.
For lagency configuration without NVDIMM meta device, journal I/O is
handled by __journal_write_unlocked() with existing code logic. If the
NVDIMM meta device is used (by bcache-tools), the journal I/O will
be handled by __journal_nvdimm_write_unlocked() and go into the NVDIMM
pages.
And when NVDIMM meta device is used, sb.d[] stores the linear addresses
from NVDIMM pages (no more bucket index), in journal_reclaim() the
journaling location in c->journal.key.ptr[0] should also be updated by
linear address from NVDIMM pages (no more LBA combined by sectors offset
and bucket index).
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/journal.c | 111 ++++++++++++++++++++++++------------
1 file changed, 75 insertions(+), 36 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 1f16d8e497cf..b242fcb47ce2 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -596,6 +596,8 @@ static void do_journal_discard(struct cache *ca)
return;
}
+ BUG_ON(bch_has_feature_nvdimm_meta(&ca->sb));
+
switch (atomic_read(&ja->discard_in_flight)) {
case DISCARD_IN_FLIGHT:
return;
@@ -661,9 +663,13 @@ static void journal_reclaim(struct cache_set *c)
goto out;
ja->cur_idx = next;
- k->ptr[0] = MAKE_PTR(0,
- bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
- ca->sb.nr_this_dev);
+ if (!bch_has_feature_nvdimm_meta(&ca->sb))
+ k->ptr[0] = MAKE_PTR(0,
+ bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
+ ca->sb.nr_this_dev);
+ else
+ k->ptr[0] = ca->sb.d[ja->cur_idx];
+
atomic_long_inc(&c->reclaimed_journal_buckets);
bkey_init(k);
@@ -729,46 +735,21 @@ static void journal_write_unlock(struct closure *cl)
spin_unlock(&c->journal.lock);
}
-static void journal_write_unlocked(struct closure *cl)
+
+static void __journal_write_unlocked(struct cache_set *c)
__releases(c->journal.lock)
{
- struct cache_set *c = container_of(cl, struct cache_set, journal.io);
- struct cache *ca = c->cache;
- struct journal_write *w = c->journal.cur;
struct bkey *k = &c->journal.key;
- unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
- ca->sb.block_size;
-
+ struct journal_write *w = c->journal.cur;
+ struct closure *cl = &c->journal.io;
+ struct cache *ca = c->cache;
struct bio *bio;
struct bio_list list;
+ unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
+ ca->sb.block_size;
bio_list_init(&list);
- if (!w->need_write) {
- closure_return_with_destructor(cl, journal_write_unlock);
- return;
- } else if (journal_full(&c->journal)) {
- journal_reclaim(c);
- spin_unlock(&c->journal.lock);
-
- btree_flush_write(c);
- continue_at(cl, journal_write, bch_journal_wq);
- return;
- }
-
- c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
-
- w->data->btree_level = c->root->level;
-
- bkey_copy(&w->data->btree_root, &c->root->key);
- bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
-
- w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
- w->data->magic = jset_magic(&ca->sb);
- w->data->version = BCACHE_JSET_VERSION;
- w->data->last_seq = last_seq(&c->journal);
- w->data->csum = csum_set(w->data);
-
for (i = 0; i < KEY_PTRS(k); i++) {
ca = PTR_CACHE(c, k, i);
bio = &ca->journal.bio;
@@ -793,7 +774,6 @@ static void journal_write_unlocked(struct closure *cl)
ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
}
-
/* If KEY_PTRS(k) == 0, this jset gets lost in air */
BUG_ON(i == 0);
@@ -805,6 +785,65 @@ static void journal_write_unlocked(struct closure *cl)
while ((bio = bio_list_pop(&list)))
closure_bio_submit(c, bio, cl);
+}
+
+static void __journal_nvdimm_write_unlocked(struct cache_set *c)
+ __releases(c->journal.lock)
+{
+ struct journal_write *w = c->journal.cur;
+ struct cache *ca = c->cache;
+ unsigned int sectors;
+
+ sectors = set_blocks(w->data, block_bytes(ca)) * ca->sb.block_size;
+ atomic_long_add(sectors, &ca->meta_sectors_written);
+
+ memcpy_flushcache((void *)c->journal.key.ptr[0], w->data, sectors << 9);
+
+ c->journal.key.ptr[0] += sectors << 9;
+ ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
+
+ atomic_dec_bug(&fifo_back(&c->journal.pin));
+ bch_journal_next(&c->journal);
+ journal_reclaim(c);
+
+ spin_unlock(&c->journal.lock);
+}
+
+static void journal_write_unlocked(struct closure *cl)
+{
+ struct cache_set *c = container_of(cl, struct cache_set, journal.io);
+ struct cache *ca = c->cache;
+ struct journal_write *w = c->journal.cur;
+
+ if (!w->need_write) {
+ closure_return_with_destructor(cl, journal_write_unlock);
+ return;
+ } else if (journal_full(&c->journal)) {
+ journal_reclaim(c);
+ spin_unlock(&c->journal.lock);
+
+ btree_flush_write(c);
+ continue_at(cl, journal_write, bch_journal_wq);
+ return;
+ }
+
+ c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
+
+ w->data->btree_level = c->root->level;
+
+ bkey_copy(&w->data->btree_root, &c->root->key);
+ bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
+
+ w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
+ w->data->magic = jset_magic(&ca->sb);
+ w->data->version = BCACHE_JSET_VERSION;
+ w->data->last_seq = last_seq(&c->journal);
+ w->data->csum = csum_set(w->data);
+
+ if (!bch_has_feature_nvdimm_meta(&ca->sb))
+ __journal_write_unlocked(c);
+ else
+ __journal_nvdimm_write_unlocked(c);
continue_at(cl, journal_write_done, NULL);
}
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 17/20] bcache: support storing bcache journal into NVDIMM meta device
2021-02-10 5:07 ` [PATCH 17/20] bcache: support storing bcache journal into " Coly Li
@ 2021-02-18 21:21 ` Nix
0 siblings, 0 replies; 26+ messages in thread
From: Nix @ 2021-02-18 21:21 UTC (permalink / raw)
To: Coly Li; +Cc: axboe, linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren
On 10 Feb 2021, Coly Li uttered the following:
> This patch implements two methods to store bcache journal to,
> 1) __journal_write_unlocked() for block interface device
> The latency method to compose bio and issue the jset bio to cache
Is this really 'latency'? I suspect from other patches it should be
'legacy', which is surely not true unless the expectation is that soon
all bcache users will have NVDIMMs and can use the other path (surely
not).
> For lagency
This non-word should possibly be 'legacy' too?
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 18/20] bcache: read jset from NVDIMM pages for journal replay
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (16 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 17/20] bcache: support storing bcache journal into " Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 19/20] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
` (2 subsequent siblings)
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
This patch implements two methods to read jset from media for journal
replay,
- __jnl_rd_bkt() for block device
This is the legacy method to read jset via block device interface.
- __jnl_rd_nvm_bkt() for NVDIMM
This is the method to read jset from NVDIMM memory interface, a.k.a
memcopy() from NVDIMM pages to DRAM pages.
If BCH_FEATURE_INCOMPAT_NVDIMM_META is set in incompat feature set,
during running cache set, journal_read_bucket() will read the journal
content from NVDIMM by __jnl_rd_nvm_bkt(). The linear addresses of
NVDIMM pages to read jset are stored in sb.d[SB_JOURNAL_BUCKETS], which
were initialized and maintained in previous runs of the cache set.
A thing should be noticed is, when bch_journal_read() is called, the
linear address of NVDIMM pages is not loaded and initialized yet, it
is necessary to call __bch_journal_nvdimm_init() before reading the jset
from NVDIMM pages.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/journal.c | 81 ++++++++++++++++++++++++++-----------
1 file changed, 57 insertions(+), 24 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index b242fcb47ce2..8d08627f5a89 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -34,60 +34,84 @@ static void journal_read_endio(struct bio *bio)
closure_put(cl);
}
+static struct jset *__jnl_rd_bkt(struct cache *ca, unsigned int bkt_idx,
+ unsigned int len, unsigned int offset,
+ struct closure *cl)
+{
+ sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bkt_idx]);
+ struct bio *bio = &ca->journal.bio;
+ struct jset *data = ca->set->journal.w[0].data;
+
+ bio_reset(bio);
+ bio->bi_iter.bi_sector = bucket + offset;
+ bio_set_dev(bio, ca->bdev);
+ bio->bi_iter.bi_size = len << 9;
+ bio->bi_end_io = journal_read_endio;
+ bio->bi_private = cl;
+ bio_set_op_attrs(bio, REQ_OP_READ, 0);
+ bch_bio_map(bio, data);
+
+ closure_bio_submit(ca->set, bio, cl);
+ closure_sync(cl);
+
+ /* Indeed journal.w[0].data */
+ return data;
+}
+
+static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
+ unsigned int len, unsigned int offset)
+{
+ void *jset_addr = (void *)ca->sb.d[bkt_idx] + (offset << 9);
+ struct jset *data = ca->set->journal.w[0].data;
+
+ memcpy(data, jset_addr, len << 9);
+
+ /* Indeed journal.w[0].data */
+ return data;
+}
+
static int journal_read_bucket(struct cache *ca, struct list_head *list,
- unsigned int bucket_index)
+ unsigned int bucket_idx)
{
struct journal_device *ja = &ca->journal;
- struct bio *bio = &ja->bio;
struct journal_replay *i;
- struct jset *j, *data = ca->set->journal.w[0].data;
+ struct jset *j;
struct closure cl;
unsigned int len, left, offset = 0;
int ret = 0;
- sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bucket_index]);
closure_init_stack(&cl);
- pr_debug("reading %u\n", bucket_index);
+ pr_debug("reading %u\n", bucket_idx);
while (offset < ca->sb.bucket_size) {
reread: left = ca->sb.bucket_size - offset;
len = min_t(unsigned int, left, PAGE_SECTORS << JSET_BITS);
- bio_reset(bio);
- bio->bi_iter.bi_sector = bucket + offset;
- bio_set_dev(bio, ca->bdev);
- bio->bi_iter.bi_size = len << 9;
-
- bio->bi_end_io = journal_read_endio;
- bio->bi_private = &cl;
- bio_set_op_attrs(bio, REQ_OP_READ, 0);
- bch_bio_map(bio, data);
-
- closure_bio_submit(ca->set, bio, &cl);
- closure_sync(&cl);
+ if (!bch_has_feature_nvdimm_meta(&ca->sb))
+ j = __jnl_rd_bkt(ca, bucket_idx, len, offset, &cl);
+ else
+ j = __jnl_rd_nvm_bkt(ca, bucket_idx, len, offset);
/* This function could be simpler now since we no longer write
* journal entries that overlap bucket boundaries; this means
* the start of a bucket will always have a valid journal entry
* if it has any journal entries at all.
*/
-
- j = data;
while (len) {
struct list_head *where;
size_t blocks, bytes = set_bytes(j);
if (j->magic != jset_magic(&ca->sb)) {
- pr_debug("%u: bad magic\n", bucket_index);
+ pr_debug("%u: bad magic\n", bucket_idx);
return ret;
}
if (bytes > left << 9 ||
bytes > PAGE_SIZE << JSET_BITS) {
pr_info("%u: too big, %zu bytes, offset %u\n",
- bucket_index, bytes, offset);
+ bucket_idx, bytes, offset);
return ret;
}
@@ -96,7 +120,7 @@ reread: left = ca->sb.bucket_size - offset;
if (j->csum != csum_set(j)) {
pr_info("%u: bad csum, %zu bytes, offset %u\n",
- bucket_index, bytes, offset);
+ bucket_idx, bytes, offset);
return ret;
}
@@ -158,8 +182,8 @@ reread: left = ca->sb.bucket_size - offset;
list_add(&i->list, where);
ret = 1;
- if (j->seq > ja->seq[bucket_index])
- ja->seq[bucket_index] = j->seq;
+ if (j->seq > ja->seq[bucket_idx])
+ ja->seq[bucket_idx] = j->seq;
next_set:
offset += blocks * ca->sb.block_size;
len -= blocks * ca->sb.block_size;
@@ -170,6 +194,8 @@ reread: left = ca->sb.bucket_size - offset;
return ret;
}
+static int __bch_journal_nvdimm_init(struct cache *ca);
+
int bch_journal_read(struct cache_set *c, struct list_head *list)
{
#define read_bucket(b) \
@@ -188,6 +214,13 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
unsigned int i, l, r, m;
uint64_t seq;
+ /*
+ * Linear addresses of NVDIMM pages for journaling is not
+ * initialized yet, do it before read jset from NVDIMM pages.
+ */
+ if (bch_has_feature_nvdimm_meta(&ca->sb))
+ __bch_journal_nvdimm_init(ca);
+
bitmap_zero(bitmap, SB_JOURNAL_BUCKETS);
pr_debug("%u journal buckets\n", ca->sb.njournal_buckets);
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 19/20] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (17 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 18/20] bcache: read jset from NVDIMM pages for journal replay Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 5:07 ` [PATCH 20/20] bcache: only initialize nvm-pages allocator when CONFIG_BCACHE_NVM_PAGES configured Coly Li
2021-02-10 15:11 ` [PATCH 00/20] bcache patches for Linux v5.12 Jens Axboe
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
This patch adds a sysfs interface register_nvdimm_meta to register
NVDIMM meta device. The sysfs interface file only shows up when
CONFIG_BCACHE_NVM_PAGES=y. Then a NVDIMM name space formatted by
bcache-tools can be registered into bcache by e.g.,
echo /dev/pmem0 > /sys/fs/bcache/register_nvdimm_meta
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/super.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 57c96c16ee16..61fd5802a627 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2415,10 +2415,18 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
struct kobj_attribute *attr,
const char *buffer, size_t size);
+#ifdef CONFIG_BCACHE_NVM_PAGES
+static ssize_t register_nvdimm_meta(struct kobject *k,
+ struct kobj_attribute *attr,
+ const char *buffer, size_t size);
+#endif
kobj_attribute_write(register, register_bcache);
kobj_attribute_write(register_quiet, register_bcache);
kobj_attribute_write(pendings_cleanup, bch_pending_bdevs_cleanup);
+#ifdef CONFIG_BCACHE_NVM_PAGES
+kobj_attribute_write(register_nvdimm_meta, register_nvdimm_meta);
+#endif
static bool bch_is_open_backing(dev_t dev)
{
@@ -2532,6 +2540,24 @@ static void register_device_async(struct async_reg_args *args)
queue_delayed_work(system_wq, &args->reg_work, 10);
}
+#ifdef CONFIG_BCACHE_NVM_PAGES
+static ssize_t register_nvdimm_meta(struct kobject *k, struct kobj_attribute *attr,
+ const char *buffer, size_t size)
+{
+ ssize_t ret = size;
+
+ struct bch_nvm_namespace *ns = bch_register_namespace(buffer);
+
+ if (IS_ERR(ns)) {
+ pr_err("register nvdimm namespace %s for meta device failed.\n",
+ buffer);
+ ret = -EINVAL;
+ }
+
+ return size;
+}
+#endif
+
static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
const char *buffer, size_t size)
{
@@ -2867,6 +2893,9 @@ static int __init bcache_init(void)
static const struct attribute *files[] = {
&ksysfs_register.attr,
&ksysfs_register_quiet.attr,
+#ifdef CONFIG_BCACHE_NVM_PAGES
+ &ksysfs_register_nvdimm_meta.attr,
+#endif
&ksysfs_pendings_cleanup.attr,
NULL
};
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 20/20] bcache: only initialize nvm-pages allocator when CONFIG_BCACHE_NVM_PAGES configured
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (18 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 19/20] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
@ 2021-02-10 5:07 ` Coly Li
2021-02-10 15:11 ` [PATCH 00/20] bcache patches for Linux v5.12 Jens Axboe
20 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-10 5:07 UTC (permalink / raw)
To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren
It is unnecessary to initialize the EXPERIMENTAL nvm-pages allocator
when CONFIG_BCACHE_NVM_PAGES is not configured. This patch uses
"#ifdef CONFIG_BCACHE_NVM_PAGES" to wrap bch_nvm_init() and
bch_nvm_exit(), and only calls them when bch_nvm_exit is configured.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
drivers/md/bcache/super.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 61fd5802a627..c273eeef0d38 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2845,7 +2845,9 @@ static void bcache_exit(void)
{
bch_debug_exit();
bch_request_exit();
+#ifdef CONFIG_BCACHE_NVM_PAGES
bch_nvm_exit();
+#endif
if (bcache_kobj)
kobject_put(bcache_kobj);
if (bcache_wq)
@@ -2947,7 +2949,9 @@ static int __init bcache_init(void)
bch_debug_init();
closure_debug_init();
+#ifdef CONFIG_BCACHE_NVM_PAGES
bch_nvm_init();
+#endif
bcache_is_reboot = false;
--
2.26.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 00/20] bcache patches for Linux v5.12
2021-02-10 5:07 [PATCH 00/20] bcache patches for Linux v5.12 Coly Li
` (19 preceding siblings ...)
2021-02-10 5:07 ` [PATCH 20/20] bcache: only initialize nvm-pages allocator when CONFIG_BCACHE_NVM_PAGES configured Coly Li
@ 2021-02-10 15:11 ` Jens Axboe
2021-02-12 16:09 ` Coly Li
20 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-02-10 15:11 UTC (permalink / raw)
To: Coly Li; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Kai Krakow
On 2/9/21 10:07 PM, Coly Li wrote:
> Hi Jens,
>
> This is the first wave bcache patches for Linux v5.12.
>
> It is nice to see in this round we have 3 new patch contributors:
> Jianpeng Ma, Qiaowei Ren and Kai Krakow.
>
> In this series, the EXPERIMENTAL patches from Jianpeng Ma, Qiaowei Ren
> and me are initial effort to store bcache meta-data on NVDIMM namespace.
> The NVDIMM space is managed and mapped via DAX interface, and accessed
> by linear address. In this submission we store bcache journal on NVDIMM,
> in future bcache btree nodes and other meta data will be added in too,
> before we remove the EXPERIMENTAL statues.
>
> Dongdong Tao contributes a performance optimization when
> bcache cache buckets are highly fregmented, Dongdong's patch makes the
> dirty data writeback faster and from his benchmark reprots such changes
> have recognized improvement for randome write I/O thoughput and latency
> for highly fregmented buckets, and no regression for regular I/O
> observed.
>
> Kai Krakow contributes 4 patches to offload system_wq usage to separated
> btree_io_wq and bch_flush_wq. In his environment the daily backup job
> throughput increases from 60.2MB/s to 419MB/s and accomplished time
> reduced from 14h29m to 2h13m.
>
> Joe Perches also contributes a fine code stype fix which I pick for this
> submission.
>
> Please take them for Linux v5.12 merge window.
Applied 1-6 for now, that weird situation with the user visible header
needs to get resolved before it can go any further.
--
Jens Axboe
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 00/20] bcache patches for Linux v5.12
2021-02-10 15:11 ` [PATCH 00/20] bcache patches for Linux v5.12 Jens Axboe
@ 2021-02-12 16:09 ` Coly Li
0 siblings, 0 replies; 26+ messages in thread
From: Coly Li @ 2021-02-12 16:09 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Kai Krakow
On 2/10/21 11:11 PM, Jens Axboe wrote:
> On 2/9/21 10:07 PM, Coly Li wrote:
>> Hi Jens,
>>
>> This is the first wave bcache patches for Linux v5.12.
>>
>> It is nice to see in this round we have 3 new patch contributors:
>> Jianpeng Ma, Qiaowei Ren and Kai Krakow.
>>
>> In this series, the EXPERIMENTAL patches from Jianpeng Ma, Qiaowei Ren
>> and me are initial effort to store bcache meta-data on NVDIMM namespace.
>> The NVDIMM space is managed and mapped via DAX interface, and accessed
>> by linear address. In this submission we store bcache journal on NVDIMM,
>> in future bcache btree nodes and other meta data will be added in too,
>> before we remove the EXPERIMENTAL statues.
>>
>> Dongdong Tao contributes a performance optimization when
>> bcache cache buckets are highly fregmented, Dongdong's patch makes the
>> dirty data writeback faster and from his benchmark reprots such changes
>> have recognized improvement for randome write I/O thoughput and latency
>> for highly fregmented buckets, and no regression for regular I/O
>> observed.
>>
>> Kai Krakow contributes 4 patches to offload system_wq usage to separated
>> btree_io_wq and bch_flush_wq. In his environment the daily backup job
>> throughput increases from 60.2MB/s to 419MB/s and accomplished time
>> reduced from 14h29m to 2h13m.
>>
>> Joe Perches also contributes a fine code stype fix which I pick for this
>> submission.
>>
>> Please take them for Linux v5.12 merge window.
>
> Applied 1-6 for now, that weird situation with the user visible header
> needs to get resolved before it can go any further.
>
Thanks for taking care of the patches and offering your opinion. I will
ask you and other developers' suggestion for a proper form for the data
structure definition.
Coly Li
^ permalink raw reply [flat|nested] 26+ messages in thread