linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] bcache-next 20231120
@ 2023-11-20  5:24 Coly Li
  2023-11-20  5:24 ` [PATCH 01/10] bcache: avoid oversize memory allocation by small stripe_size Coly Li
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Coly Li

Hi Jens,

Here are recent bcache patches tested by me on Linux v6.7-rc1 and can be
applied on branch block-6.7 of linux-block tree. 

In this series, Colin provids useful code cleanup, some small and still
important fixes are contributed from Mingzhe, Rand. Also there are some
fixes and a code comments patch from me.

When I fixed the issue found during testing, 6.7-rc1 was released. IMHO
it still makes sense to have most of the fixes into 6.7. Please consider
to take them in block-6.7 if it is possible.

Thanks in advance.

Coly Li
---

Colin Ian King (1):
  bcache: remove redundant assignment to variable cur_idx

Coly Li (5):
  bcache: avoid oversize memory allocation by small stripe_size
  bcache: check return value from btree_node_alloc_replacement()
  bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in
    btree_gc_coalesce()
  bcache: add code comments for bch_btree_node_get() and
    __bch_btree_node_alloc()
  bcache: avoid NULL checking to c->root in run_cache_set()

Mingzhe Zou (3):
  bcache: fixup init dirty data errors
  bcache: fixup lock c->root error
  bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race

Rand Deeb (1):
  bcache: prevent potential division by zero error

 drivers/md/bcache/bcache.h    |  1 +
 drivers/md/bcache/btree.c     | 11 ++++++++++-
 drivers/md/bcache/super.c     |  4 +++-
 drivers/md/bcache/sysfs.c     |  2 +-
 drivers/md/bcache/writeback.c | 24 ++++++++++++++++++------
 5 files changed, 33 insertions(+), 9 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 01/10] bcache: avoid oversize memory allocation by small stripe_size
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
@ 2023-11-20  5:24 ` Coly Li
  2023-11-20  5:24 ` [PATCH 02/10] bcache: check return value from btree_node_alloc_replacement() Coly Li
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Coly Li, Andrea Tomassetti, Eric Wheeler

Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
used for dirty data writeback, their sizes are decided by backing device
capacity and stripe size. Larger backing device capacity or smaller
stripe size make these two arraies occupies more dynamic memory space.

Currently bcache->stripe_size is directly inherited from
queue->limits.io_opt of underlying storage device. For normal hard
drives, its limits.io_opt is 0, and bcache sets the corresponding
stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
devices do declare value for queue->limits.io_opt, small stripe_size
(comparing to 1TB) becomes an issue for oversize memory allocations of
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
capacity of hard drives gets much larger in recent decade.

For example a raid5 array assembled by three 20TB hardrives, the raid
device capacity is 40TB with typical 512KB limits.io_opt. After the math
calculation in bcache code, these two arraies will occupy 400MB dynamic
memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
declared on a new 2TB hard drive, then these two arraies request 2GB and
512MB dynamic memory from kzalloc(). The result is that bcache device
always fails to initialize on his system.

To avoid the oversize memory allocation, bcache->stripe_size should not
directly inherited by queue->limits.io_opt from the underlying device.
This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
and set bcache device's stripe size against the declared limits.io_opt
value from the underlying storage device,
- If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size directly by this limits.io_opt value.
- If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size by a value multiplying limits.io_opt and euqal or
  large than BCH_MIN_STRIPE_SZ.

Then the minimal stripe size of a bcache device will always be >= 4MB.
For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
by these two arraies will be 2.5MB in total.

Such mount of memory allocated for bcache->stripe_sectors_dirty and
bcache->full_dirty_stripes is reasonable for most of storage devices.

Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@lists.ewheeler.net>
---
 drivers/md/bcache/bcache.h | 1 +
 drivers/md/bcache/super.c  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 05be59ae21b2..6ae2329052c9 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -265,6 +265,7 @@ struct bcache_device {
 #define BCACHE_DEV_WB_RUNNING		3
 #define BCACHE_DEV_RATE_DW_RUNNING	4
 	int			nr_stripes;
+#define BCH_MIN_STRIPE_SZ		((4 << 20) >> SECTOR_SHIFT)
 	unsigned int		stripe_size;
 	atomic_t		*stripe_sectors_dirty;
 	unsigned long		*full_dirty_stripes;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8bd899766372..c7ecc7058d77 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -905,6 +905,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 
 	if (!d->stripe_size)
 		d->stripe_size = 1 << 31;
+	else if (d->stripe_size < BCH_MIN_STRIPE_SZ)
+		d->stripe_size = roundup(BCH_MIN_STRIPE_SZ, d->stripe_size);
 
 	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
 	if (!n || n > max_stripes) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 02/10] bcache: check return value from btree_node_alloc_replacement()
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
  2023-11-20  5:24 ` [PATCH 01/10] bcache: avoid oversize memory allocation by small stripe_size Coly Li
@ 2023-11-20  5:24 ` Coly Li
  2023-11-20  5:24 ` [PATCH 03/10] bcache: remove redundant assignment to variable cur_idx Coly Li
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Coly Li, Dan Carpenter, stable

In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns
from btree_gc_rewrite_node(). There is potential possibility that 'n' is
a non NULL ERR_PTR(), referencing such error code is not permitted in
following code. Therefore a return value checking is necessary after 'n'
is back from btree_node_alloc_replacement().

Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/btree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ae5cbb55861f..de8d552201dc 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1532,6 +1532,8 @@ static int btree_gc_rewrite_node(struct btree *b, struct btree_op *op,
 		return 0;
 
 	n = btree_node_alloc_replacement(replace, NULL);
+	if (IS_ERR(n))
+		return 0;
 
 	/* recheck reserve after allocating replacement node */
 	if (btree_check_reserve(b, NULL)) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 03/10] bcache: remove redundant assignment to variable cur_idx
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
  2023-11-20  5:24 ` [PATCH 01/10] bcache: avoid oversize memory allocation by small stripe_size Coly Li
  2023-11-20  5:24 ` [PATCH 02/10] bcache: check return value from btree_node_alloc_replacement() Coly Li
@ 2023-11-20  5:24 ` Coly Li
  2023-11-20  5:24 ` [PATCH 04/10] bcache: prevent potential division by zero error Coly Li
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Colin Ian King, Coly Li

From: Colin Ian King <colin.i.king@gmail.com>

Variable cur_idx is being initialized with a value that is never read,
it is being re-assigned later in a while-loop. Remove the redundant
assignment. Cleans up clang scan build warning:

drivers/md/bcache/writeback.c:916:2: warning: Value stored to 'cur_idx'
is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 24c049067f61..c3e872e0a6f2 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -913,7 +913,7 @@ static int bch_dirty_init_thread(void *arg)
 	int cur_idx, prev_idx, skip_nr;
 
 	k = p = NULL;
-	cur_idx = prev_idx = 0;
+	prev_idx = 0;
 
 	bch_btree_iter_init(&c->root->keys, &iter, NULL);
 	k = bch_btree_iter_next_filter(&iter, &c->root->keys, bch_ptr_bad);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 04/10] bcache: prevent potential division by zero error
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (2 preceding siblings ...)
  2023-11-20  5:24 ` [PATCH 03/10] bcache: remove redundant assignment to variable cur_idx Coly Li
@ 2023-11-20  5:24 ` Coly Li
  2023-11-20  5:24 ` [PATCH 05/10] bcache: fixup init dirty data errors Coly Li
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Rand Deeb, stable, Coly Li

From: Rand Deeb <rand.sec96@gmail.com>

In SHOW(), the variable 'n' is of type 'size_t.' While there is a
conditional check to verify that 'n' is not equal to zero before
executing the 'do_div' macro, concerns arise regarding potential
division by zero error in 64-bit environments.

The concern arises when 'n' is 64 bits in size, greater than zero, and
the lower 32 bits of it are zeros. In such cases, the conditional check
passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to
'uint32_t,' effectively truncating it to its lower 32 bits.
Consequently, the 'n' value becomes zero.

To fix this potential division by zero error and ensure precise
division handling, this commit replaces the 'do_div' macro with
div64_u64(). div64_u64() is designed to work with 64-bit operands,
guaranteeing that division is performed correctly.

This change enhances the robustness of the code, ensuring that division
operations yield accurate results in all scenarios, eliminating the
possibility of division by zero, and improving compatibility across
different 64-bit environments.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Rand Deeb <rand.sec96@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 45d8af755de6..a438efb66069 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -1104,7 +1104,7 @@ SHOW(__bch_cache)
 			sum += INITIAL_PRIO - cached[i];
 
 		if (n)
-			do_div(sum, n);
+			sum = div64_u64(sum, n);
 
 		for (i = 0; i < ARRAY_SIZE(q); i++)
 			q[i] = INITIAL_PRIO - cached[n * (i + 1) /
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 05/10] bcache: fixup init dirty data errors
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (3 preceding siblings ...)
  2023-11-20  5:24 ` [PATCH 04/10] bcache: prevent potential division by zero error Coly Li
@ 2023-11-20  5:24 ` Coly Li
  2023-11-20  5:24 ` [PATCH 06/10] bcache: fixup lock c->root error Coly Li
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Mingzhe Zou, stable, Coly Li

From: Mingzhe Zou <mingzhe.zou@easystack.cn>

We found that after long run, the dirty_data of the bcache device
will have errors. This error cannot be eliminated unless re-register.

We also found that reattach after detach, this error can accumulate.

In bch_sectors_dirty_init(), all inode <= d->id keys will be recounted
again. This is wrong, we only need to count the keys of the current
device.

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/writeback.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index c3e872e0a6f2..77fb72ac6b81 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -991,8 +991,11 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 		op.count = 0;
 
 		for_each_key_filter(&c->root->keys,
-				    k, &iter, bch_ptr_invalid)
+				    k, &iter, bch_ptr_invalid) {
+			if (KEY_INODE(k) != op.inode)
+				continue;
 			sectors_dirty_init_fn(&op.op, c->root, k);
+		}
 
 		rw_unlock(0, c->root);
 		return;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 06/10] bcache: fixup lock c->root error
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (4 preceding siblings ...)
  2023-11-20  5:24 ` [PATCH 05/10] bcache: fixup init dirty data errors Coly Li
@ 2023-11-20  5:24 ` Coly Li
  2023-11-20  5:25 ` [PATCH 07/10] bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race Coly Li
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Mingzhe Zou, stable, Coly Li

From: Mingzhe Zou <mingzhe.zou@easystack.cn>

We had a problem with io hung because it was waiting for c->root to
release the lock.

crash> cache_set.root -l cache_set.list ffffa03fde4c0050
  root = 0xffff802ef454c800
crash> btree -o 0xffff802ef454c800 | grep rw_semaphore
  [ffff802ef454c858] struct rw_semaphore lock;
crash> struct rw_semaphore ffff802ef454c858
struct rw_semaphore {
  count = {
    counter = -4294967297
  },
  wait_list = {
    next = 0xffff00006786fc28,
    prev = 0xffff00005d0efac8
  },
  wait_lock = {
    raw_lock = {
      {
        val = {
          counter = 0
        },
        {
          locked = 0 '\000',
          pending = 0 '\000'
        },
        {
          locked_pending = 0,
          tail = 0
        }
      }
    }
  },
  osq = {
    tail = {
      counter = 0
    }
  },
  owner = 0xffffa03fdc586603
}

The "counter = -4294967297" means that lock count is -1 and a write lock
is being attempted. Then, we found that there is a btree with a counter
of 1 in btree_cache_freeable.

crash> cache_set -l cache_set.list ffffa03fde4c0050 -o|grep btree_cache
  [ffffa03fde4c1140] struct list_head btree_cache;
  [ffffa03fde4c1150] struct list_head btree_cache_freeable;
  [ffffa03fde4c1160] struct list_head btree_cache_freed;
  [ffffa03fde4c1170] unsigned int btree_cache_used;
  [ffffa03fde4c1178] wait_queue_head_t btree_cache_wait;
  [ffffa03fde4c1190] struct task_struct *btree_cache_alloc_lock;
crash> list -H ffffa03fde4c1140|wc -l
973
crash> list -H ffffa03fde4c1150|wc -l
1123
crash> cache_set.btree_cache_used -l cache_set.list ffffa03fde4c0050
  btree_cache_used = 2097
crash> list -s btree -l btree.list -H ffffa03fde4c1140|grep -E -A2 "^  lock = {" > btree_cache.txt
crash> list -s btree -l btree.list -H ffffa03fde4c1150|grep -E -A2 "^  lock = {" > btree_cache_freeable.txt
[root@node-3 127.0.0.1-2023-08-04-16:40:28]# pwd
/var/crash/127.0.0.1-2023-08-04-16:40:28
[root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache.txt|grep counter|grep -v "counter = 0"
[root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache_freeable.txt|grep counter|grep -v "counter = 0"
      counter = 1

We found that this is a bug in bch_sectors_dirty_init() when locking c->root:
    (1). Thread X has locked c->root(A) write.
    (2). Thread Y failed to lock c->root(A), waiting for the lock(c->root A).
    (3). Thread X bch_btree_set_root() changes c->root from A to B.
    (4). Thread X releases the lock(c->root A).
    (5). Thread Y successfully locks c->root(A).
    (6). Thread Y releases the lock(c->root B).

        down_write locked ---(1)----------------------┐
                |                                     |
                |   down_read waiting ---(2)----┐     |
                |           |               ┌-------------┐ ┌-------------┐
        bch_btree_set_root ===(3)========>> | c->root   A | | c->root   B |
                |           |               └-------------┘ └-------------┘
            up_write ---(4)---------------------┘     |            |
                            |                         |            |
                    down_read locked ---(5)-----------┘            |
                            |                                      |
                        up_read ---(6)-----------------------------┘

Since c->root may change, the correct steps to lock c->root should be
the same as bch_root_usage(), compare after locking.

static unsigned int bch_root_usage(struct cache_set *c)
{
        unsigned int bytes = 0;
        struct bkey *k;
        struct btree *b;
        struct btree_iter iter;

        goto lock_root;

        do {
                rw_unlock(false, b);
lock_root:
                b = c->root;
                rw_lock(false, b, b->level);
        } while (b != c->root);

        for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad)
                bytes += bkey_bytes(k);

        rw_unlock(false, b);

        return (bytes * 100) / btree_bytes(c);
}

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/writeback.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 77fb72ac6b81..a1d760916246 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -977,14 +977,22 @@ static int bch_btre_dirty_init_thread_nr(void)
 void bch_sectors_dirty_init(struct bcache_device *d)
 {
 	int i;
+	struct btree *b = NULL;
 	struct bkey *k = NULL;
 	struct btree_iter iter;
 	struct sectors_dirty_init op;
 	struct cache_set *c = d->c;
 	struct bch_dirty_init_state state;
 
+retry_lock:
+	b = c->root;
+	rw_lock(0, b, b->level);
+	if (b != c->root) {
+		rw_unlock(0, b);
+		goto retry_lock;
+	}
+
 	/* Just count root keys if no leaf node */
-	rw_lock(0, c->root, c->root->level);
 	if (c->root->level == 0) {
 		bch_btree_op_init(&op.op, -1);
 		op.inode = d->id;
@@ -997,7 +1005,7 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 			sectors_dirty_init_fn(&op.op, c->root, k);
 		}
 
-		rw_unlock(0, c->root);
+		rw_unlock(0, b);
 		return;
 	}
 
@@ -1033,7 +1041,7 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 out:
 	/* Must wait for all threads to stop. */
 	wait_event(state.wait, atomic_read(&state.started) == 0);
-	rw_unlock(0, c->root);
+	rw_unlock(0, b);
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 07/10] bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (5 preceding siblings ...)
  2023-11-20  5:24 ` [PATCH 06/10] bcache: fixup lock c->root error Coly Li
@ 2023-11-20  5:25 ` Coly Li
  2023-11-20  5:25 ` [PATCH 08/10] bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce() Coly Li
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Mingzhe Zou, stable, Coly Li

From: Mingzhe Zou <mingzhe.zou@easystack.cn>

We get a kernel crash about "unable to handle kernel paging request":

```dmesg
[368033.032005] BUG: unable to handle kernel paging request at ffffffffad9ae4b5
[368033.032007] PGD fc3a0d067 P4D fc3a0d067 PUD fc3a0e063 PMD 8000000fc38000e1
[368033.032012] Oops: 0003 [#1] SMP PTI
[368033.032015] CPU: 23 PID: 55090 Comm: bch_dirtcnt[0] Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.5.1.es8_24.x86_64 #1
[368033.032017] Hardware name: Tsinghua Tongfang THTF Chaoqiang Server/072T6D, BIOS 2.4.3 01/17/2017
[368033.032027] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1d0
[368033.032029] Code: 8b 02 48 85 c0 74 f6 48 89 c1 eb d0 c1 e9 12 83 e0
03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 c0 3d 02 00 48 03 04 cd 60 68 93
ad <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 02
[368033.032031] RSP: 0018:ffffbb48852abe00 EFLAGS: 00010082
[368033.032032] RAX: ffffffffad9ae4b5 RBX: 0000000000000246 RCX: 0000000000003bf3
[368033.032033] RDX: ffff97b0ff8e3dc0 RSI: 0000000000600000 RDI: ffffbb4884743c68
[368033.032034] RBP: 0000000000000001 R08: 0000000000000000 R09: 000007ffffffffff
[368033.032035] R10: ffffbb486bb01000 R11: 0000000000000001 R12: ffffffffc068da70
[368033.032036] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[368033.032038] FS:  0000000000000000(0000) GS:ffff97b0ff8c0000(0000) knlGS:0000000000000000
[368033.032039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[368033.032040] CR2: ffffffffad9ae4b5 CR3: 0000000fc3a0a002 CR4: 00000000003626e0
[368033.032042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[368033.032043] bcache: bch_cached_dev_attach() Caching rbd479 as bcache462 on set 8cff3c36-4a76-4242-afaa-7630206bc70b
[368033.032045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[368033.032046] Call Trace:
[368033.032054]  _raw_spin_lock_irqsave+0x32/0x40
[368033.032061]  __wake_up_common_lock+0x63/0xc0
[368033.032073]  ? bch_ptr_invalid+0x10/0x10 [bcache]
[368033.033502]  bch_dirty_init_thread+0x14c/0x160 [bcache]
[368033.033511]  ? read_dirty_submit+0x60/0x60 [bcache]
[368033.033516]  kthread+0x112/0x130
[368033.033520]  ? kthread_flush_work_fn+0x10/0x10
[368033.034505]  ret_from_fork+0x35/0x40
```

The crash occurred when call wake_up(&state->wait), and then we want
to look at the value in the state. However, bch_sectors_dirty_init()
is not found in the stack of any task. Since state is allocated on
the stack, we guess that bch_sectors_dirty_init() has exited, causing
bch_dirty_init_thread() to be unable to handle kernel paging request.

In order to verify this idea, we added some printing information during
wake_up(&state->wait). We find that "wake up" is printed twice, however
we only expect the last thread to wake up once.

```dmesg
[  994.641004] alcache: bch_dirty_init_thread() wake up
[  994.641018] alcache: bch_dirty_init_thread() wake up
[  994.641523] alcache: bch_sectors_dirty_init() init exit
```

There is a race. If bch_sectors_dirty_init() exits after the first wake
up, the second wake up will trigger this bug("unable to handle kernel
paging request").

Proceed as follows:

bch_sectors_dirty_init
    kthread_run ==============> bch_dirty_init_thread(bch_dirtcnt[0])
            ...                         ...
    atomic_inc(&state.started)          ...
            ...                         ...
    atomic_read(&state.enough)          ...
            ...                 atomic_set(&state->enough, 1)
    kthread_run ======================================================> bch_dirty_init_thread(bch_dirtcnt[1])
            ...                 atomic_dec_and_test(&state->started)            ...
    atomic_inc(&state.started)          ...                                     ...
            ...                 wake_up(&state->wait)                           ...
    atomic_read(&state.enough)                                          atomic_dec_and_test(&state->started)
            ...                                                                 ...
    wait_event(state.wait, atomic_read(&state.started) == 0)                    ...
    return                                                                      ...
                                                                        wake_up(&state->wait)

We believe it is very common to wake up twice if there is no dirty, but
crash is an extremely low probability event. It's hard for us to reproduce
this issue. We attached and detached continuously for a week, with a total
of more than one million attaches and only one crash.

Putting atomic_inc(&state.started) before kthread_run() can avoid waking
up twice.

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/writeback.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a1d760916246..3accfdaee6b1 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -1025,17 +1025,18 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 		if (atomic_read(&state.enough))
 			break;
 
+		atomic_inc(&state.started);
 		state.infos[i].state = &state;
 		state.infos[i].thread =
 			kthread_run(bch_dirty_init_thread, &state.infos[i],
 				    "bch_dirtcnt[%d]", i);
 		if (IS_ERR(state.infos[i].thread)) {
 			pr_err("fails to run thread bch_dirty_init[%d]\n", i);
+			atomic_dec(&state.started);
 			for (--i; i >= 0; i--)
 				kthread_stop(state.infos[i].thread);
 			goto out;
 		}
-		atomic_inc(&state.started);
 	}
 
 out:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 08/10] bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (6 preceding siblings ...)
  2023-11-20  5:25 ` [PATCH 07/10] bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race Coly Li
@ 2023-11-20  5:25 ` Coly Li
  2023-11-20  5:25 ` [PATCH 09/10] bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc() Coly Li
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Coly Li, stable, Zheng Wang

Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in
node allocations") do the following change inside btree_gc_coalesce(),

31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce(
32         memset(new_nodes, 0, sizeof(new_nodes));
33         closure_init_stack(&cl);
34
35 -       while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
36 +       while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b))
37                 keys += r[nodes++].keys;
38
39         blocks = btree_default_blocks(b->c) * 2 / 3;

At line 35 the original r[nodes].b is not always allocatored from
__bch_btree_node_alloc(), and possibly initialized as NULL pointer by
caller of btree_gc_coalesce(). Therefore the change at line 36 is not
correct.

This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid
potential issue.

Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations")
Cc: stable@vger.kernel.org # 6.5+
Cc: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index de8d552201dc..79f1fa4a0d55 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1368,7 +1368,7 @@ static int btree_gc_coalesce(struct btree *b, struct btree_op *op,
 	memset(new_nodes, 0, sizeof(new_nodes));
 	closure_init_stack(&cl);
 
-	while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b))
+	while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
 		keys += r[nodes++].keys;
 
 	blocks = btree_default_blocks(b->c) * 2 / 3;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 09/10] bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (7 preceding siblings ...)
  2023-11-20  5:25 ` [PATCH 08/10] bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce() Coly Li
@ 2023-11-20  5:25 ` Coly Li
  2023-11-20  5:25 ` [PATCH 10/10] bcache: avoid NULL checking to c->root in run_cache_set() Coly Li
  2023-11-20 16:18 ` [PATCH 00/10] bcache-next 20231120 Jens Axboe
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Coly Li

This patch adds code comments to bch_btree_node_get() and
__bch_btree_node_alloc() that NULL pointer will not be returned and it
is unnecessary to check NULL pointer by the callers of these routines.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 79f1fa4a0d55..de3019972b35 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1000,6 +1000,9 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
  *
  * The btree node will have either a read or a write lock held, depending on
  * level and op->lock.
+ *
+ * Note: Only error code or btree pointer will be returned, it is unncessary
+ *       for callers to check NULL pointer.
  */
 struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op,
 				 struct bkey *k, int level, bool write,
@@ -1111,6 +1114,10 @@ static void btree_node_free(struct btree *b)
 	mutex_unlock(&b->c->bucket_lock);
 }
 
+/*
+ * Only error code or btree pointer will be returned, it is unncessary for
+ * callers to check NULL pointer.
+ */
 struct btree *__bch_btree_node_alloc(struct cache_set *c, struct btree_op *op,
 				     int level, bool wait,
 				     struct btree *parent)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 10/10] bcache: avoid NULL checking to c->root in run_cache_set()
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (8 preceding siblings ...)
  2023-11-20  5:25 ` [PATCH 09/10] bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc() Coly Li
@ 2023-11-20  5:25 ` Coly Li
  2023-11-20 16:18 ` [PATCH 00/10] bcache-next 20231120 Jens Axboe
  10 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2023-11-20  5:25 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-bcache, Coly Li

In run_cache_set() after c->root returned from bch_btree_node_get(), it
is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL
because bch_btree_node_get() will not return NULL pointer to caller.

This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index c7ecc7058d77..bfe1685dbae5 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2018,7 +2018,7 @@ static int run_cache_set(struct cache_set *c)
 		c->root = bch_btree_node_get(c, NULL, k,
 					     j->btree_level,
 					     true, NULL);
-		if (IS_ERR_OR_NULL(c->root))
+		if (IS_ERR(c->root))
 			goto err;
 
 		list_del_init(&c->root->list);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/10] bcache-next 20231120
  2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
                   ` (9 preceding siblings ...)
  2023-11-20  5:25 ` [PATCH 10/10] bcache: avoid NULL checking to c->root in run_cache_set() Coly Li
@ 2023-11-20 16:18 ` Jens Axboe
  10 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2023-11-20 16:18 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-block, linux-bcache


On Mon, 20 Nov 2023 13:24:53 +0800, Coly Li wrote:
> Here are recent bcache patches tested by me on Linux v6.7-rc1 and can be
> applied on branch block-6.7 of linux-block tree.
> 
> In this series, Colin provids useful code cleanup, some small and still
> important fixes are contributed from Mingzhe, Rand. Also there are some
> fixes and a code comments patch from me.
> 
> [...]

Applied, thanks!

[01/10] bcache: avoid oversize memory allocation by small stripe_size
        commit: baf8fb7e0e5ec54ea0839f0c534f2cdcd79bea9c
[02/10] bcache: check return value from btree_node_alloc_replacement()
        commit: 777967e7e9f6f5f3e153abffb562bffaf4430d26
[03/10] bcache: remove redundant assignment to variable cur_idx
        commit: be93825f0e6428c2d3f03a6e4d447dc48d33d7ff
[04/10] bcache: prevent potential division by zero error
        commit: 2c7f497ac274a14330208b18f6f734000868ebf9
[05/10] bcache: fixup init dirty data errors
        commit: 7cc47e64d3d69786a2711a4767e26b26ba63d7ed
[06/10] bcache: fixup lock c->root error
        commit: e34820f984512b433ee1fc291417e60c47d56727
[07/10] bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race
        commit: 2faac25d7958c4761bb8cec54adb79f806783ad6
[08/10] bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()
        commit: f72f4312d4388376fc8a1f6cf37cb21a0d41758b
[09/10] bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()
        commit: 31f5b956a197d4ec25c8a07cb3a2ab69d0c0b82f
[10/10] bcache: avoid NULL checking to c->root in run_cache_set()
        commit: 3eba5e0b2422aec3c9e79822029599961fdcab97

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-11-20 16:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-20  5:24 [PATCH 00/10] bcache-next 20231120 Coly Li
2023-11-20  5:24 ` [PATCH 01/10] bcache: avoid oversize memory allocation by small stripe_size Coly Li
2023-11-20  5:24 ` [PATCH 02/10] bcache: check return value from btree_node_alloc_replacement() Coly Li
2023-11-20  5:24 ` [PATCH 03/10] bcache: remove redundant assignment to variable cur_idx Coly Li
2023-11-20  5:24 ` [PATCH 04/10] bcache: prevent potential division by zero error Coly Li
2023-11-20  5:24 ` [PATCH 05/10] bcache: fixup init dirty data errors Coly Li
2023-11-20  5:24 ` [PATCH 06/10] bcache: fixup lock c->root error Coly Li
2023-11-20  5:25 ` [PATCH 07/10] bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race Coly Li
2023-11-20  5:25 ` [PATCH 08/10] bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce() Coly Li
2023-11-20  5:25 ` [PATCH 09/10] bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc() Coly Li
2023-11-20  5:25 ` [PATCH 10/10] bcache: avoid NULL checking to c->root in run_cache_set() Coly Li
2023-11-20 16:18 ` [PATCH 00/10] bcache-next 20231120 Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).