linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/18] bcache patches for Linux v5.2
@ 2019-04-24 16:48 Coly Li
  2019-04-24 16:48 ` [PATCH 01/18] bcache: fix crashes stopping bcache device before read miss done Coly Li
                   ` (18 more replies)
  0 siblings, 19 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Hi Jens,

Here are the bcache patches for Linux v5.2.

Most of the patches are bug fixes. We don't introduce new feature in
this run, people continue to focus on code quality improvement for now.

Juhui Tang and I make some fixes related to bcache journal, the changes
include,
- Fix KEY_PTRS set to 0 error under very heavy small I/O pressure, to
  avoid potential missing for journal and uuid update to cache device.
- Add code comments to make code more understandable.
- Minor code cleanup.

Liang Chen, Guoju Fang and Shenghui Wang contribute useful and important
fixes for,
- Avoid a crash in cached_dev_read_done() by adding reference before
  calling bio_complete()
- Improve unused buckets statistic
- Fix a race between cache register and unregister
- Potential use-after-free and memory leak bugs

Arnd Bergmann and George Spelvin also contriubte useful code clean up
to fix clang warning and make bch_get_congested() more clear.

Please pick them for the 5.2 merge window.

Thanks you in advance.

Coly Li
---

Arnd Bergmann (1):
  bcache: avoid clang -Wunintialized warning

Coly Li (9):
  bcache: move definition of 'int ret' out of macro read_bucket()
  bcache: never set KEY_PTRS of jouranl key to 0 in journal_reclaim()
  bcache: add failure check to run_cache_set() for journal replay
  bcache: add comments for kobj release callback routine
  bcache: return error immediately in bch_journal_replay()
  bcache: add error check for calling register_bdev()
  bcache: Add comments for blkdev_put() in registration code path
  bcache: add comments for closure_fn to be called in closure_queue()
  bcache: improve bcache_reboot()

Geliang Tang (1):
  bcache: use kmemdup_nul for CACHED_LABEL buffer

George Spelvin (1):
  bcache: Clean up bch_get_congested()

Guoju Fang (2):
  bcache: fix crashes stopping bcache device before read miss done
  bcache: fix inaccurate result of unused buckets

Liang Chen (1):
  bcache: fix a race between cache register and cacheset unregister

Shenghui Wang (2):
  bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce
    branch of btree_gc_coalesce
  bcache: avoid potential memleak of list of journal_replay(s) in the
    CACHE_SYNC branch of run_cache_set

Tang Junhui (1):
  bcache: fix failure in journal relplay

 drivers/md/bcache/alloc.c   |  5 +--
 drivers/md/bcache/btree.c   |  2 +-
 drivers/md/bcache/journal.c | 42 ++++++++++++++++++-----
 drivers/md/bcache/request.c | 41 +++++++++++++++-------
 drivers/md/bcache/request.h |  2 +-
 drivers/md/bcache/super.c   | 83 ++++++++++++++++++++++++++++++++++-----------
 drivers/md/bcache/sysfs.c   |  2 --
 drivers/md/bcache/util.h    | 26 ++++++++++----
 8 files changed, 149 insertions(+), 54 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/18] bcache: fix crashes stopping bcache device before read miss done
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 02/18] bcache: fix inaccurate result of unused buckets Coly Li
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Guoju Fang, Coly Li

From: Guoju Fang <fangguoju@gmail.com>

The bio from upper layer is considered completed when bio_complete()
returns. In most scenarios bio_complete() is called in search_free(),
but when read miss happens, the bio_compete() is called when backing
device reading completed, while the struct search is still in use until
cache inserting finished.

If someone stops the bcache device just then, the device may be closed
and released, but after cache inserting finished the struct search will
access a freed struct cached_dev.

This patch add the reference of bcache device before bio_complete() when
read miss happens, and put it after the search is not used.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/request.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index f101bfe8657a..f11123079fe0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -706,14 +706,14 @@ static void search_free(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 
-	atomic_dec(&s->d->c->search_inflight);
+	atomic_dec(&s->iop.c->search_inflight);
 
 	if (s->iop.bio)
 		bio_put(s->iop.bio);
 
 	bio_complete(s);
 	closure_debug_destroy(cl);
-	mempool_free(s, &s->d->c->search);
+	mempool_free(s, &s->iop.c->search);
 }
 
 static inline struct search *search_alloc(struct bio *bio,
@@ -756,13 +756,13 @@ static void cached_dev_bio_complete(struct closure *cl)
 	struct search *s = container_of(cl, struct search, cl);
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 
-	search_free(cl);
 	cached_dev_put(dc);
+	search_free(cl);
 }
 
 /* Process reads */
 
-static void cached_dev_cache_miss_done(struct closure *cl)
+static void cached_dev_read_error_done(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 
@@ -800,7 +800,22 @@ static void cached_dev_read_error(struct closure *cl)
 		closure_bio_submit(s->iop.c, bio, cl);
 	}
 
-	continue_at(cl, cached_dev_cache_miss_done, NULL);
+	continue_at(cl, cached_dev_read_error_done, NULL);
+}
+
+static void cached_dev_cache_miss_done(struct closure *cl)
+{
+	struct search *s = container_of(cl, struct search, cl);
+	struct bcache_device *d = s->d;
+
+	if (s->iop.replace_collision)
+		bch_mark_cache_miss_collision(s->iop.c, s->d);
+
+	if (s->iop.bio)
+		bio_free_pages(s->iop.bio);
+
+	cached_dev_bio_complete(cl);
+	closure_put(&d->cl);
 }
 
 static void cached_dev_read_done(struct closure *cl)
@@ -833,6 +848,7 @@ static void cached_dev_read_done(struct closure *cl)
 	if (verify(dc) && s->recoverable && !s->read_dirty_data)
 		bch_data_verify(dc, s->orig_bio);
 
+	closure_get(&dc->disk.cl);
 	bio_complete(s);
 
 	if (s->iop.bio &&
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/18] bcache: fix inaccurate result of unused buckets
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
  2019-04-24 16:48 ` [PATCH 01/18] bcache: fix crashes stopping bcache device before read miss done Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 03/18] bcache: avoid clang -Wunintialized warning Coly Li
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Guoju Fang, Coly Li

From: Guoju Fang <fangguoju@gmail.com>

To get the amount of unused buckets in sysfs_priority_stats, the code
count the buckets which GC_SECTORS_USED is zero. It's correct and should
not be overwritten by the count of buckets which prio is zero.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/sysfs.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 17bae9c14ca0..6cd44d3cf906 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -996,8 +996,6 @@ SHOW(__bch_cache)
 		       !cached[n - 1])
 			--n;
 
-		unused = ca->sb.nbuckets - n;
-
 		while (cached < p + n &&
 		       *cached == BTREE_PRIO)
 			cached++, n--;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/18] bcache: avoid clang -Wunintialized warning
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
  2019-04-24 16:48 ` [PATCH 01/18] bcache: fix crashes stopping bcache device before read miss done Coly Li
  2019-04-24 16:48 ` [PATCH 02/18] bcache: fix inaccurate result of unused buckets Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 04/18] bcache: use kmemdup_nul for CACHED_LABEL buffer Coly Li
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Arnd Bergmann, Coly Li

From: Arnd Bergmann <arnd@arndb.de>

clang has identified a code path in which it thinks a
variable may be unused:

drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
                        fifo_pop(&ca->free_inc, bucket);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
 #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
        if (_r) {                                                       \
            ^~
drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
                        allocator_wait(ca, bch_allocator_push(ca, bucket));
                                                                  ^~~~~~
drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
                if (cond)                                               \
                    ^~~~
drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
                        fifo_pop(&ca->free_inc, bucket);
                        ^
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
 #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                ^
drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
        if (_r) {                                                       \
        ^
drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
                        long bucket;
                                   ^

This cannot happen in practice because we only enter the loop
if there is at least one element in the list.

Slightly rearranging the code makes this clearer to both the
reader and the compiler, which avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 5002838ea476..f8986effcb50 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -327,10 +327,11 @@ static int bch_allocator_thread(void *arg)
 		 * possibly issue discards to them, then we add the bucket to
 		 * the free list:
 		 */
-		while (!fifo_empty(&ca->free_inc)) {
+		while (1) {
 			long bucket;
 
-			fifo_pop(&ca->free_inc, bucket);
+			if (!fifo_pop(&ca->free_inc, bucket))
+				break;
 
 			if (ca->discard) {
 				mutex_unlock(&ca->set->bucket_lock);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/18] bcache: use kmemdup_nul for CACHED_LABEL buffer
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (2 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 03/18] bcache: avoid clang -Wunintialized warning Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 05/18] bcache: Clean up bch_get_congested() Coly Li
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Geliang Tang, Coly Li

From: Geliang Tang <geliangtang@gmail.com>

This patch uses kmemdup_nul to create a NUL-terminated string from
dc->sb.label. This is better than open coding it.

With this, we can move env[2] initialization into env[] array to make
code more elegant.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a697a3a923cd..6e618cb6126c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -906,21 +906,18 @@ static int cached_dev_status_update(void *arg)
 void bch_cached_dev_run(struct cached_dev *dc)
 {
 	struct bcache_device *d = &dc->disk;
-	char buf[SB_LABEL_SIZE + 1];
+	char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL);
 	char *env[] = {
 		"DRIVER=bcache",
 		kasprintf(GFP_KERNEL, "CACHED_UUID=%pU", dc->sb.uuid),
-		NULL,
+		kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf ? : ""),
 		NULL,
 	};
 
-	memcpy(buf, dc->sb.label, SB_LABEL_SIZE);
-	buf[SB_LABEL_SIZE] = '\0';
-	env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf);
-
 	if (atomic_xchg(&dc->running, 1)) {
 		kfree(env[1]);
 		kfree(env[2]);
+		kfree(buf);
 		return;
 	}
 
@@ -944,6 +941,7 @@ void bch_cached_dev_run(struct cached_dev *dc)
 	kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env);
 	kfree(env[1]);
 	kfree(env[2]);
+	kfree(buf);
 
 	if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") ||
 	    sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache"))
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/18] bcache: Clean up bch_get_congested()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (3 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 04/18] bcache: use kmemdup_nul for CACHED_LABEL buffer Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 06/18] bcache: fix a race between cache register and cacheset unregister Coly Li
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, George Spelvin, Coly Li

From: George Spelvin <lkml@sdf.org>

There are a few nits in this function.  They could in theory all
be separate patches, but that's probably taking small commits
too far.

1) I added a brief comment saying what it does.

2) I like to declare pointer parameters "const" where possible
   for documentation reasons.

3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8).  Passing by reference in a 64-bit variable
is silly; just use hweight32().

4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision.  Some analysis reveals that this
bit is always available.

This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:

0000000000000000 <foo1>:
   0:   89 f9                   mov    %edi,%ecx
   2:   c1 e9 06                shr    $0x6,%ecx
   5:   b8 01 00 00 00          mov    $0x1,%eax
   a:   d3 e0                   shl    %cl,%eax
   c:   83 e7 3f                and    $0x3f,%edi
   f:   d3 e7                   shl    %cl,%edi
  11:   c1 ef 06                shr    $0x6,%edi
  14:   01 f8                   add    %edi,%eax
  16:   c3                      retq

To 19:

0000000000000017 <foo2>:
  17:   89 f8                   mov    %edi,%eax
  19:   83 e0 3f                and    $0x3f,%eax
  1c:   83 c0 40                add    $0x40,%eax
  1f:   89 f9                   mov    %edi,%ecx
  21:   c1 e9 06                shr    $0x6,%ecx
  24:   d3 e0                   shl    %cl,%eax
  26:   c1 e8 06                shr    $0x6,%eax
  29:   c3                      retq

(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)

5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it.  Move the computation down to
where it's needed.  This also saves a local register to hold the
computed value.

Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/request.c | 15 ++++++++-------
 drivers/md/bcache/request.h |  2 +-
 drivers/md/bcache/util.h    | 26 +++++++++++++++++++-------
 3 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index f11123079fe0..41adcd1546f1 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -329,12 +329,13 @@ void bch_data_insert(struct closure *cl)
 	bch_data_insert_start(cl);
 }
 
-/* Congested? */
-
-unsigned int bch_get_congested(struct cache_set *c)
+/*
+ * Congested?  Return 0 (not congested) or the limit (in sectors)
+ * beyond which we should bypass the cache due to congestion.
+ */
+unsigned int bch_get_congested(const struct cache_set *c)
 {
 	int i;
-	long rand;
 
 	if (!c->congested_read_threshold_us &&
 	    !c->congested_write_threshold_us)
@@ -353,8 +354,7 @@ unsigned int bch_get_congested(struct cache_set *c)
 	if (i > 0)
 		i = fract_exp_two(i, 6);
 
-	rand = get_random_int();
-	i -= bitmap_weight(&rand, BITS_PER_LONG);
+	i -= hweight32(get_random_u32());
 
 	return i > 0 ? i : 1;
 }
@@ -376,7 +376,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
 {
 	struct cache_set *c = dc->disk.c;
 	unsigned int mode = cache_mode(dc);
-	unsigned int sectors, congested = bch_get_congested(c);
+	unsigned int sectors, congested;
 	struct task_struct *task = current;
 	struct io *i;
 
@@ -412,6 +412,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
 			goto rescale;
 	}
 
+	congested = bch_get_congested(c);
 	if (!congested && !dc->sequential_cutoff)
 		goto rescale;
 
diff --git a/drivers/md/bcache/request.h b/drivers/md/bcache/request.h
index 721bf336ed1a..c64dbd7a91aa 100644
--- a/drivers/md/bcache/request.h
+++ b/drivers/md/bcache/request.h
@@ -33,7 +33,7 @@ struct data_insert_op {
 	BKEY_PADDED(replace_key);
 };
 
-unsigned int bch_get_congested(struct cache_set *c);
+unsigned int bch_get_congested(const struct cache_set *c);
 void bch_data_insert(struct closure *cl);
 
 void bch_cached_dev_request_init(struct cached_dev *dc);
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 00aab6abcfe4..1fbced94e4cc 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -560,17 +560,29 @@ static inline uint64_t bch_crc64_update(uint64_t crc,
 	return crc;
 }
 
-/* Does linear interpolation between powers of two */
+/*
+ * A stepwise-linear pseudo-exponential.  This returns 1 << (x >>
+ * frac_bits), with the less-significant bits filled in by linear
+ * interpolation.
+ *
+ * This can also be interpreted as a floating-point number format,
+ * where the low frac_bits are the mantissa (with implicit leading
+ * 1 bit), and the more significant bits are the exponent.
+ * The return value is 1.mantissa * 2^exponent.
+ *
+ * The way this is used, fract_bits is 6 and the largest possible
+ * input is CONGESTED_MAX-1 = 1023 (exponent 16, mantissa 0x1.fc),
+ * so the maximum output is 0x1fc00.
+ */
 static inline unsigned int fract_exp_two(unsigned int x,
 					 unsigned int fract_bits)
 {
-	unsigned int fract = x & ~(~0 << fract_bits);
-
-	x >>= fract_bits;
-	x   = 1 << x;
-	x  += (x * fract) >> fract_bits;
+	unsigned int mantissa = 1 << fract_bits;	/* Implicit bit */
 
-	return x;
+	mantissa += x & (mantissa - 1);
+	x >>= fract_bits;	/* The exponent */
+	/* Largest intermediate value 0x7f0000 */
+	return mantissa << x >> fract_bits;
 }
 
 void bch_bio_map(struct bio *bio, void *base);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/18] bcache: fix a race between cache register and cacheset unregister
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (4 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 05/18] bcache: Clean up bch_get_congested() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 07/18] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Liang Chen, stable, Coly Li

From: Liang Chen <liangchen.linux@gmail.com>

There is a race between cache device register and cache set unregister.
For an already registered cache device, register_bcache will call
bch_is_open to iterate through all cachesets and check every cache
there. The race occurs if cache_set_free executes at the same time and
clears the caches right before ca is dereferenced in bch_is_open_cache.
To close the race, let's make sure the clean up work is protected by
the bch_register_lock as well.

This issue can be reproduced as follows,
while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &

and results in the following oops,

[  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
[  +0.000457] #PF error: [normal kernel read fault]
[  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
[  +0.000388] Oops: 0000 [#1] SMP PTI
[  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
[  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
[  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
[  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
[  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
[  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
[  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
[  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
[  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
[  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
[  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
[  +0.000837] Call Trace:
[  +0.000682]  ? _cond_resched+0x10/0x20
[  +0.000691]  ? __kmalloc+0x131/0x1b0
[  +0.000710]  kernfs_fop_write+0xfa/0x170
[  +0.000733]  __vfs_write+0x2e/0x190
[  +0.000688]  ? inode_security+0x10/0x30
[  +0.000698]  ? selinux_file_permission+0xd2/0x120
[  +0.000752]  ? security_file_permission+0x2b/0x100
[  +0.000753]  vfs_write+0xa8/0x1a0
[  +0.000676]  ksys_write+0x4d/0xb0
[  +0.000699]  do_syscall_64+0x3a/0xf0
[  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 6e618cb6126c..53c5e3e0ac22 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1514,6 +1514,7 @@ static void cache_set_free(struct closure *cl)
 	bch_btree_cache_free(c);
 	bch_journal_free(c);
 
+	mutex_lock(&bch_register_lock);
 	for_each_cache(ca, c, i)
 		if (ca) {
 			ca->set = NULL;
@@ -1532,7 +1533,6 @@ static void cache_set_free(struct closure *cl)
 	mempool_exit(&c->search);
 	kfree(c->devices);
 
-	mutex_lock(&bch_register_lock);
 	list_del(&c->list);
 	mutex_unlock(&bch_register_lock);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/18] bcache: move definition of 'int ret' out of macro read_bucket()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (5 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 06/18] bcache: fix a race between cache register and cacheset unregister Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 08/18] bcache: never set KEY_PTRS of jouranl key to 0 in journal_reclaim() Coly Li
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/journal.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index b2fd412715b1..6e18057d1d82 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 {
 #define read_bucket(b)							\
 	({								\
-		int ret = journal_read_bucket(ca, list, b);		\
+		ret = journal_read_bucket(ca, list, b);			\
 		__set_bit(b, bitmap);					\
 		if (ret < 0)						\
 			return ret;					\
@@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 
 	struct cache *ca;
 	unsigned int iter;
+	int ret = 0;
 
 	for_each_cache(ca, c, iter) {
 		struct journal_device *ja = &ca->journal;
@@ -267,7 +268,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 					    struct journal_replay,
 					    list)->j.seq;
 
-	return 0;
+	return ret;
 #undef read_bucket
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/18] bcache: never set KEY_PTRS of jouranl key to 0 in journal_reclaim()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (6 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 07/18] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 09/18] bcache: add failure check to run_cache_set() for journal replay Coly Li
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, stable

In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530       struct journal_device *ja = &ca->journal;
531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533       /* No space available on this device */
534       if (next == ja->discard_idx)
535               continue;
536
537       ja->cur_idx = next;
538       k->ptr[n++] = MAKE_PTR(0,
539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                         ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/journal.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6e18057d1d82..5180bed911ef 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c)
 				  ca->sb.nr_this_dev);
 	}
 
-	bkey_init(k);
-	SET_KEY_PTRS(k, n);
-
-	if (n)
+	if (n) {
+		bkey_init(k);
+		SET_KEY_PTRS(k, n);
 		c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
+	}
 out:
 	if (!journal_full(&c->journal))
 		__closure_wake_up(&c->journal.wait);
@@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl)
 		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
 	}
 
+	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
+	BUG_ON(i == 0);
+
 	atomic_dec_bug(&fifo_back(&c->journal.pin));
 	bch_journal_next(&c->journal);
 	journal_reclaim(c);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/18] bcache: add failure check to run_cache_set() for journal replay
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (7 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 08/18] bcache: never set KEY_PTRS of jouranl key to 0 in journal_reclaim() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-25  6:01   ` Hannes Reinecke
  2019-04-24 16:48 ` [PATCH 10/18] bcache: add comments for kobj release callback routine Coly Li
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set().  The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.

This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().

If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 53c5e3e0ac22..8c7fdada0acf 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1773,7 +1773,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	return NULL;
 }
 
-static void run_cache_set(struct cache_set *c)
+static int run_cache_set(struct cache_set *c)
 {
 	const char *err = "cannot allocate memory";
 	struct cached_dev *dc, *t;
@@ -1867,7 +1867,9 @@ static void run_cache_set(struct cache_set *c)
 		if (j->version < BCACHE_JSET_VERSION_UUID)
 			__uuid_write(c);
 
-		bch_journal_replay(c, &journal);
+		err = "bcache: replay journal failed";
+		if (bch_journal_replay(c, &journal))
+			goto err;
 	} else {
 		pr_notice("invalidating existing data");
 
@@ -1935,11 +1937,13 @@ static void run_cache_set(struct cache_set *c)
 	flash_devs_run(c);
 
 	set_bit(CACHE_SET_RUNNING, &c->flags);
-	return;
+	return 0;
 err:
 	closure_sync(&cl);
 	/* XXX: test this, it's broken */
 	bch_cache_set_error(c, "%s", err);
+
+	return -EIO;
 }
 
 static bool can_attach_cache(struct cache *ca, struct cache_set *c)
@@ -2003,8 +2007,11 @@ static const char *register_cache_set(struct cache *ca)
 	ca->set->cache[ca->sb.nr_this_dev] = ca;
 	c->cache_by_alloc[c->caches_loaded++] = ca;
 
-	if (c->caches_loaded == c->sb.nr_in_set)
-		run_cache_set(c);
+	if (c->caches_loaded == c->sb.nr_in_set) {
+		err = "failed to run cache set";
+		if (run_cache_set(c) < 0)
+			goto err;
+	}
 
 	return NULL;
 err:
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/18] bcache: add comments for kobj release callback routine
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (8 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 09/18] bcache: add failure check to run_cache_set() for journal replay Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 11/18] bcache: return error immediately in bch_journal_replay() Coly Li
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
  void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
  void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
  void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
  void bch_cache_release(struct kobject *kobj)

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/super.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8c7fdada0acf..f8d80adcafec 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1172,6 +1172,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 	return 0;
 }
 
+/* when dc->disk.kobj released */
 void bch_cached_dev_release(struct kobject *kobj)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
@@ -1324,6 +1325,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 /* Flash only volumes */
 
+/* When d->kobj released */
 void bch_flash_dev_release(struct kobject *kobj)
 {
 	struct bcache_device *d = container_of(kobj, struct bcache_device,
@@ -1494,6 +1496,7 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
 	return true;
 }
 
+/* When c->kobj released */
 void bch_cache_set_release(struct kobject *kobj)
 {
 	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
@@ -2021,6 +2024,7 @@ static const char *register_cache_set(struct cache *ca)
 
 /* Cache device */
 
+/* When ca->kobj released */
 void bch_cache_release(struct kobject *kobj)
 {
 	struct cache *ca = container_of(kobj, struct cache, kobj);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/18] bcache: return error immediately in bch_journal_replay()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (9 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 10/18] bcache: add comments for kobj release callback routine Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 12/18] bcache: add error check for calling register_bdev() Coly Li
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.

So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/journal.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 5180bed911ef..828ab474696a 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -331,9 +331,12 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	list_for_each_entry(i, list, list) {
 		BUG_ON(i->pin && atomic_read(i->pin) != 1);
 
-		cache_set_err_on(n != i->j.seq, s,
-"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
-				 n, i->j.seq - 1, start, end);
+		if (n != i->j.seq) {
+			pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
+			n, i->j.seq - 1, start, end);
+			ret = -EIO;
+			goto err;
+		}
 
 		for (k = i->j.start;
 		     k < bset_bkey_last(&i->j);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 12/18] bcache: add error check for calling register_bdev()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (10 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 11/18] bcache: return error immediately in bch_journal_replay() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 13/18] bcache: Add comments for blkdev_put() in registration code path Coly Li
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/super.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f8d80adcafec..fde334939545 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1279,7 +1279,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 /* Cached device - bcache superblock */
 
-static void register_bdev(struct cache_sb *sb, struct page *sb_page,
+static int register_bdev(struct cache_sb *sb, struct page *sb_page,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1317,10 +1317,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	    BDEV_STATE(&dc->sb) == BDEV_STATE_STALE)
 		bch_cached_dev_run(dc);
 
-	return;
+	return 0;
 err:
 	pr_notice("error %s: %s", dc->backing_dev_name, err);
 	bcache_device_stop(&dc->disk);
+	return -EIO;
 }
 
 /* Flash only volumes */
@@ -2271,7 +2272,7 @@ static bool bch_is_open(struct block_device *bdev)
 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size)
 {
-	ssize_t ret = size;
+	ssize_t ret = -EINVAL;
 	const char *err = "cannot allocate memory";
 	char *path = NULL;
 	struct cache_sb *sb = NULL;
@@ -2305,7 +2306,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			if (!IS_ERR(bdev))
 				bdput(bdev);
 			if (attr == &ksysfs_register_quiet)
-				goto out;
+				goto quiet_out;
 		}
 		goto err;
 	}
@@ -2326,8 +2327,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto err_close;
 
 		mutex_lock(&bch_register_lock);
-		register_bdev(sb, sb_page, bdev, dc);
+		ret = register_bdev(sb, sb_page, bdev, dc);
 		mutex_unlock(&bch_register_lock);
+		if (ret < 0)
+			goto err;
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 
@@ -2337,6 +2340,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		if (register_cache(sb, sb_page, bdev, ca) != 0)
 			goto err;
 	}
+quiet_out:
+	ret = size;
 out:
 	if (sb_page)
 		put_page(sb_page);
@@ -2349,7 +2354,6 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 err:
 	pr_info("error %s: %s", path, err);
-	ret = -EINVAL;
 	goto out;
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 13/18] bcache: Add comments for blkdev_put() in registration code path
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (11 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 12/18] bcache: add error check for calling register_bdev() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 14/18] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/super.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index fde334939545..fa856b2ca7af 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2189,6 +2189,12 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 	ret = cache_alloc(ca);
 	if (ret != 0) {
+		/*
+		 * If we failed here, it means ca->kobj is not initialized yet,
+		 * kobject_put() won't be called and there is no chance to
+		 * call blkdev_put() to bdev in bch_cache_release(). So we
+		 * explicitly call blkdev_put() here.
+		 */
 		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 		if (ret == -ENOMEM)
 			err = "cache_alloc(): -ENOMEM";
@@ -2329,6 +2335,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		mutex_lock(&bch_register_lock);
 		ret = register_bdev(sb, sb_page, bdev, dc);
 		mutex_unlock(&bch_register_lock);
+		/* blkdev_put() will be called in cached_dev_free() */
 		if (ret < 0)
 			goto err;
 	} else {
@@ -2337,6 +2344,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		if (!ca)
 			goto err_close;
 
+		/* blkdev_put() will be called in bch_cache_release() */
 		if (register_cache(sb, sb_page, bdev, ca) != 0)
 			goto err;
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 14/18] bcache: add comments for closure_fn to be called in closure_queue()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (12 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 13/18] bcache: Add comments for blkdev_put() in registration code path Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 15/18] bcache: improve bcache_reboot() Coly Li
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index fa856b2ca7af..0363ab534c8e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -662,6 +662,11 @@ static const struct block_device_operations bcache_ops = {
 void bcache_device_stop(struct bcache_device *d)
 {
 	if (!test_and_set_bit(BCACHE_DEV_CLOSING, &d->flags))
+		/*
+		 * closure_fn set to
+		 * - cached device: cached_dev_flush()
+		 * - flash dev: flash_dev_flush()
+		 */
 		closure_queue(&d->cl);
 }
 
@@ -1675,6 +1680,7 @@ static void __cache_set_unregister(struct closure *cl)
 void bch_cache_set_stop(struct cache_set *c)
 {
 	if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags))
+		/* closure_fn set to __cache_set_unregister() */
 		closure_queue(&c->caching);
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 15/18] bcache: improve bcache_reboot()
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (13 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 14/18] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 16/18] bcache: fix failure in journal relplay Coly Li
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
---
 drivers/md/bcache/super.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0363ab534c8e..3f34b96ebbc3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2397,10 +2397,19 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 		list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
 			bcache_device_stop(&dc->disk);
 
+		mutex_unlock(&bch_register_lock);
+
+		/*
+		 * Give an early chance for other kthreads and
+		 * kworkers to stop themselves
+		 */
+		schedule();
+
 		/* What's a condition variable? */
 		while (1) {
-			long timeout = start + 2 * HZ - jiffies;
+			long timeout = start + 10 * HZ - jiffies;
 
+			mutex_lock(&bch_register_lock);
 			stopped = list_empty(&bch_cache_sets) &&
 				list_empty(&uncached_devices);
 
@@ -2412,7 +2421,6 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 
 			mutex_unlock(&bch_register_lock);
 			schedule_timeout(timeout);
-			mutex_lock(&bch_register_lock);
 		}
 
 		finish_wait(&unregister_wait, &wait);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 16/18] bcache: fix failure in journal relplay
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (14 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 15/18] bcache: improve bcache_reboot() Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 17/18] bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce Coly Li
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Tang Junhui, Coly Li

From: Tang Junhui <tang.junhui.linux@gmail.com>

journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-2076601), disabling
caching

The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.

It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.

Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.

(Coly Li: rebase the patch with other changes in bch_journal_replay())

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 828ab474696a..f9afb164b887 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -318,6 +318,18 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list)
 	}
 }
 
+bool is_discard_enabled(struct cache_set *s)
+{
+	struct cache *ca;
+	unsigned int i;
+
+	for_each_cache(ca, s, i)
+		if (ca->discard)
+			return true;
+
+	return false;
+}
+
 int bch_journal_replay(struct cache_set *s, struct list_head *list)
 {
 	int ret = 0, keys = 0, entries = 0;
@@ -332,10 +344,15 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 		BUG_ON(i->pin && atomic_read(i->pin) != 1);
 
 		if (n != i->j.seq) {
-			pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
-			n, i->j.seq - 1, start, end);
-			ret = -EIO;
-			goto err;
+			if (n == start && is_discard_enabled(s))
+				pr_info("bcache: journal entries %llu-%llu may be discarded! (replaying %llu-%llu)",
+					n, i->j.seq - 1, start, end);
+			else {
+				pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
+					n, i->j.seq - 1, start, end);
+				ret = -EIO;
+				goto err;
+			}
 		}
 
 		for (k = i->j.start;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 17/18] bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (15 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 16/18] bcache: fix failure in journal relplay Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:48 ` [PATCH 18/18] bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set Coly Li
  2019-04-24 16:57 ` [PATCH 00/18] bcache patches for Linux v5.2 Jens Axboe
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Shenghui Wang, Coly Li

From: Shenghui Wang <shhuiw@foxmail.com>

Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 64def336f053..b139858b0802 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1476,11 +1476,11 @@ static int btree_gc_coalesce(struct btree *b, struct btree_op *op,
 
 out_nocoalesce:
 	closure_sync(&cl);
-	bch_keylist_free(&keylist);
 
 	while ((k = bch_keylist_pop(&keylist)))
 		if (!bkey_cmp(k, &ZERO_KEY))
 			atomic_dec(&b->c->prio_blocked);
+	bch_keylist_free(&keylist);
 
 	for (i = 0; i < nodes; i++)
 		if (!IS_ERR_OR_NULL(new_nodes[i])) {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 18/18] bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (16 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 17/18] bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce Coly Li
@ 2019-04-24 16:48 ` Coly Li
  2019-04-24 16:57 ` [PATCH 00/18] bcache patches for Linux v5.2 Jens Axboe
  18 siblings, 0 replies; 21+ messages in thread
From: Coly Li @ 2019-04-24 16:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Shenghui Wang, Coly Li

From: Shenghui Wang <shhuiw@foxmail.com>

In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().

If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.

If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.

This patch will release the list of journal_replay(s) in the case of
error detected.

v1 -> v2:
* Move the release code to the location after label 'err:' to
  simply the change.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3f34b96ebbc3..0ffe9acee9d8 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1790,6 +1790,8 @@ static int run_cache_set(struct cache_set *c)
 	struct cache *ca;
 	struct closure cl;
 	unsigned int i;
+	LIST_HEAD(journal);
+	struct journal_replay *l;
 
 	closure_init_stack(&cl);
 
@@ -1949,6 +1951,12 @@ static int run_cache_set(struct cache_set *c)
 	set_bit(CACHE_SET_RUNNING, &c->flags);
 	return 0;
 err:
+	while (!list_empty(&journal)) {
+		l = list_first_entry(&journal, struct journal_replay, list);
+		list_del(&l->list);
+		kfree(l);
+	}
+
 	closure_sync(&cl);
 	/* XXX: test this, it's broken */
 	bch_cache_set_error(c, "%s", err);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/18] bcache patches for Linux v5.2
  2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
                   ` (17 preceding siblings ...)
  2019-04-24 16:48 ` [PATCH 18/18] bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set Coly Li
@ 2019-04-24 16:57 ` Jens Axboe
  18 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2019-04-24 16:57 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block

On 4/24/19 10:48 AM, Coly Li wrote:
> Hi Jens,
> 
> Here are the bcache patches for Linux v5.2.
> 
> Most of the patches are bug fixes. We don't introduce new feature in
> this run, people continue to focus on code quality improvement for now.
> 
> Juhui Tang and I make some fixes related to bcache journal, the changes
> include,
> - Fix KEY_PTRS set to 0 error under very heavy small I/O pressure, to
>   avoid potential missing for journal and uuid update to cache device.
> - Add code comments to make code more understandable.
> - Minor code cleanup.
> 
> Liang Chen, Guoju Fang and Shenghui Wang contribute useful and important
> fixes for,
> - Avoid a crash in cached_dev_read_done() by adding reference before
>   calling bio_complete()
> - Improve unused buckets statistic
> - Fix a race between cache register and unregister
> - Potential use-after-free and memory leak bugs
> 
> Arnd Bergmann and George Spelvin also contriubte useful code clean up
> to fix clang warning and make bch_get_congested() more clear.
> 
> Please pick them for the 5.2 merge window.

Applied, thanks Coly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 09/18] bcache: add failure check to run_cache_set() for journal replay
  2019-04-24 16:48 ` [PATCH 09/18] bcache: add failure check to run_cache_set() for journal replay Coly Li
@ 2019-04-25  6:01   ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2019-04-25  6:01 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block

On 4/24/19 6:48 PM, Coly Li wrote:
> Currently run_cache_set() has no return value, if there is failure in
> bch_journal_replay(), the caller of run_cache_set() has no idea about
> such failure and just continue to execute following code after
> run_cache_set().  The internal failure is triggered inside
> bch_journal_replay() and being handled in async way. This behavior is
> inefficient, while failure handling inside bch_journal_replay(), cache
> register code is still running to start the cache set. Registering and
> unregistering code running as same time may introduce some rare race
> condition, and make the code to be more hard to be understood.
> 
> This patch adds return value to run_cache_set(), and returns -EIO if
> bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
> such failure and stop registering code flow immedidately inside
> register_cache_set().
> 
> If journal replay fails, run_cache_set() can report error immediately
> to register_cache_set(). This patch makes the failure handling for
> bch_journal_replay() be in synchronized way, easier to understand and
> debug, and avoid poetential race condition for register-and-unregister
> in same time.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 17 ++++++++++++-----
>   1 file changed, 12 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-04-25  6:01 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-24 16:48 [PATCH 00/18] bcache patches for Linux v5.2 Coly Li
2019-04-24 16:48 ` [PATCH 01/18] bcache: fix crashes stopping bcache device before read miss done Coly Li
2019-04-24 16:48 ` [PATCH 02/18] bcache: fix inaccurate result of unused buckets Coly Li
2019-04-24 16:48 ` [PATCH 03/18] bcache: avoid clang -Wunintialized warning Coly Li
2019-04-24 16:48 ` [PATCH 04/18] bcache: use kmemdup_nul for CACHED_LABEL buffer Coly Li
2019-04-24 16:48 ` [PATCH 05/18] bcache: Clean up bch_get_congested() Coly Li
2019-04-24 16:48 ` [PATCH 06/18] bcache: fix a race between cache register and cacheset unregister Coly Li
2019-04-24 16:48 ` [PATCH 07/18] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
2019-04-24 16:48 ` [PATCH 08/18] bcache: never set KEY_PTRS of jouranl key to 0 in journal_reclaim() Coly Li
2019-04-24 16:48 ` [PATCH 09/18] bcache: add failure check to run_cache_set() for journal replay Coly Li
2019-04-25  6:01   ` Hannes Reinecke
2019-04-24 16:48 ` [PATCH 10/18] bcache: add comments for kobj release callback routine Coly Li
2019-04-24 16:48 ` [PATCH 11/18] bcache: return error immediately in bch_journal_replay() Coly Li
2019-04-24 16:48 ` [PATCH 12/18] bcache: add error check for calling register_bdev() Coly Li
2019-04-24 16:48 ` [PATCH 13/18] bcache: Add comments for blkdev_put() in registration code path Coly Li
2019-04-24 16:48 ` [PATCH 14/18] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
2019-04-24 16:48 ` [PATCH 15/18] bcache: improve bcache_reboot() Coly Li
2019-04-24 16:48 ` [PATCH 16/18] bcache: fix failure in journal relplay Coly Li
2019-04-24 16:48 ` [PATCH 17/18] bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce Coly Li
2019-04-24 16:48 ` [PATCH 18/18] bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set Coly Li
2019-04-24 16:57 ` [PATCH 00/18] bcache patches for Linux v5.2 Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).