linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] bcache fixes for Linux v5.13-rc6
@ 2021-06-07 10:35 Coly Li
  2021-06-07 10:35 ` [PATCH v5 1/2] bcache: remove bcache device self-defined readahead Coly Li
  2021-06-07 10:35 ` [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path Coly Li
  0 siblings, 2 replies; 9+ messages in thread
From: Coly Li @ 2021-06-07 10:35 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, linux-kernel, Coly Li

Hi Jens,

This series is important for recent reported bcache panic partially
triggered by recent bio changes since Linux v5.12.

Current fix is 5th version since the first effort, it might not be
perfect yet, but it survives from different workloads from Rolf,
Thorsten and me for more than 1 week in total.

Considering many people are waiting for a stable enough fix and it is
kind of such fix. Please take them for Linux v5.13-rc6.

Thank you in advance for taking care of them.

Coly Li
---

Coly Li (2):
  bcache: remove bcache device self-defined readahead
  bcache: avoid oversized read request in cache missing code path

 drivers/md/bcache/bcache.h  |  1 -
 drivers/md/bcache/request.c | 19 ++++++-------------
 drivers/md/bcache/stats.c   | 14 --------------
 drivers/md/bcache/stats.h   |  1 -
 drivers/md/bcache/sysfs.c   |  4 ----
 5 files changed, 6 insertions(+), 33 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/2] bcache: remove bcache device self-defined readahead
  2021-06-07 10:35 [PATCH 0/2] bcache fixes for Linux v5.13-rc6 Coly Li
@ 2021-06-07 10:35 ` Coly Li
  2021-06-07 11:00   ` Christoph Hellwig
  2021-06-07 10:35 ` [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path Coly Li
  1 sibling, 1 reply; 9+ messages in thread
From: Coly Li @ 2021-06-07 10:35 UTC (permalink / raw)
  To: axboe
  Cc: linux-bcache, linux-block, linux-kernel, Coly Li,
	Alexander Ullrich, Diego Ercolani, Jan Szubiak, Marco Rebhan,
	Matthias Ferdinand, Victor Westerhuis, Vojtech Pavlik,
	Rolf Fokkens, Thorsten Knabe, stable, Christoph Hellwig,
	Kent Overstreet, Nix, Takashi Iwai

For read cache missing, bcache defines a readahead size for the read I/O
request to the backing device for the missing data. This readahead size
is initialized to 0, and almost no one uses it to avoid unnecessary read
amplifying onto backing device and write amplifying onto cache device.
Considering upper layer file system code has readahead logic allready
and works fine with readahead_cache_policy sysfile interface, we don't
have to keep bcache self-defined readahead anymore.

This patch removes the bcache self-defined readahead for cache missing
request for backing device, and the readahead sysfs file interfaces are
removed as well.

This is the preparation for next patch to fix potential kernel panic due
to oversized request in a simpler method.

Reported-by: Alexander Ullrich <ealex1979@gmail.com>
Reported-by: Diego Ercolani <diego.ercolani@gmail.com>
Reported-by: Jan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: Marco Rebhan <me@dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Reported-by: Victor Westerhuis <victor@westerhu.is>
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
---
Changlog,
v1, the initial version by hint from  Christoph Hellwig.

 drivers/md/bcache/bcache.h  |  1 -
 drivers/md/bcache/request.c | 13 +------------
 drivers/md/bcache/stats.c   | 14 --------------
 drivers/md/bcache/stats.h   |  1 -
 drivers/md/bcache/sysfs.c   |  4 ----
 5 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 0a4551e165ab..5fc989a6d452 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -364,7 +364,6 @@ struct cached_dev {
 
 	/* The rest of this all shows up in sysfs */
 	unsigned int		sequential_cutoff;
-	unsigned int		readahead;
 
 	unsigned int		io_disable:1;
 	unsigned int		verify:1;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 29c231758293..ab8ff18df32a 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -880,7 +880,6 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 				 struct bio *bio, unsigned int sectors)
 {
 	int ret = MAP_CONTINUE;
-	unsigned int reada = 0;
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	struct bio *miss, *cache_bio;
 
@@ -892,14 +891,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 		goto out_submit;
 	}
 
-	if (!(bio->bi_opf & REQ_RAHEAD) &&
-	    !(bio->bi_opf & (REQ_META|REQ_PRIO)) &&
-	    s->iop.c->gc_stats.in_use < CUTOFF_CACHE_READA)
-		reada = min_t(sector_t, dc->readahead >> 9,
-			      get_capacity(bio->bi_bdev->bd_disk) -
-			      bio_end_sector(bio));
-
-	s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
+	s->insert_bio_sectors = min(sectors, bio_sectors(bio));
 
 	s->iop.replace_key = KEY(s->iop.inode,
 				 bio->bi_iter.bi_sector + s->insert_bio_sectors,
@@ -933,9 +925,6 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO))
 		goto out_put;
 
-	if (reada)
-		bch_mark_cache_readahead(s->iop.c, s->d);
-
 	s->cache_miss	= miss;
 	s->iop.bio	= cache_bio;
 	bio_get(cache_bio);
diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c
index 503aafe188dc..4c7ee5fedb9d 100644
--- a/drivers/md/bcache/stats.c
+++ b/drivers/md/bcache/stats.c
@@ -46,7 +46,6 @@ read_attribute(cache_misses);
 read_attribute(cache_bypass_hits);
 read_attribute(cache_bypass_misses);
 read_attribute(cache_hit_ratio);
-read_attribute(cache_readaheads);
 read_attribute(cache_miss_collisions);
 read_attribute(bypassed);
 
@@ -64,7 +63,6 @@ SHOW(bch_stats)
 		    DIV_SAFE(var(cache_hits) * 100,
 			     var(cache_hits) + var(cache_misses)));
 
-	var_print(cache_readaheads);
 	var_print(cache_miss_collisions);
 	sysfs_hprint(bypassed,	var(sectors_bypassed) << 9);
 #undef var
@@ -86,7 +84,6 @@ static struct attribute *bch_stats_files[] = {
 	&sysfs_cache_bypass_hits,
 	&sysfs_cache_bypass_misses,
 	&sysfs_cache_hit_ratio,
-	&sysfs_cache_readaheads,
 	&sysfs_cache_miss_collisions,
 	&sysfs_bypassed,
 	NULL
@@ -113,7 +110,6 @@ void bch_cache_accounting_clear(struct cache_accounting *acc)
 	acc->total.cache_misses = 0;
 	acc->total.cache_bypass_hits = 0;
 	acc->total.cache_bypass_misses = 0;
-	acc->total.cache_readaheads = 0;
 	acc->total.cache_miss_collisions = 0;
 	acc->total.sectors_bypassed = 0;
 }
@@ -145,7 +141,6 @@ static void scale_stats(struct cache_stats *stats, unsigned long rescale_at)
 		scale_stat(&stats->cache_misses);
 		scale_stat(&stats->cache_bypass_hits);
 		scale_stat(&stats->cache_bypass_misses);
-		scale_stat(&stats->cache_readaheads);
 		scale_stat(&stats->cache_miss_collisions);
 		scale_stat(&stats->sectors_bypassed);
 	}
@@ -168,7 +163,6 @@ static void scale_accounting(struct timer_list *t)
 	move_stat(cache_misses);
 	move_stat(cache_bypass_hits);
 	move_stat(cache_bypass_misses);
-	move_stat(cache_readaheads);
 	move_stat(cache_miss_collisions);
 	move_stat(sectors_bypassed);
 
@@ -209,14 +203,6 @@ void bch_mark_cache_accounting(struct cache_set *c, struct bcache_device *d,
 	mark_cache_stats(&c->accounting.collector, hit, bypass);
 }
 
-void bch_mark_cache_readahead(struct cache_set *c, struct bcache_device *d)
-{
-	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
-
-	atomic_inc(&dc->accounting.collector.cache_readaheads);
-	atomic_inc(&c->accounting.collector.cache_readaheads);
-}
-
 void bch_mark_cache_miss_collision(struct cache_set *c, struct bcache_device *d)
 {
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
diff --git a/drivers/md/bcache/stats.h b/drivers/md/bcache/stats.h
index abfaabf7e7fc..ca4f435f7216 100644
--- a/drivers/md/bcache/stats.h
+++ b/drivers/md/bcache/stats.h
@@ -7,7 +7,6 @@ struct cache_stat_collector {
 	atomic_t cache_misses;
 	atomic_t cache_bypass_hits;
 	atomic_t cache_bypass_misses;
-	atomic_t cache_readaheads;
 	atomic_t cache_miss_collisions;
 	atomic_t sectors_bypassed;
 };
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index cc89f3156d1a..05ac1d6fbbf3 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -137,7 +137,6 @@ rw_attribute(io_disable);
 rw_attribute(discard);
 rw_attribute(running);
 rw_attribute(label);
-rw_attribute(readahead);
 rw_attribute(errors);
 rw_attribute(io_error_limit);
 rw_attribute(io_error_halflife);
@@ -260,7 +259,6 @@ SHOW(__bch_cached_dev)
 	var_printf(partial_stripes_expensive,	"%u");
 
 	var_hprint(sequential_cutoff);
-	var_hprint(readahead);
 
 	sysfs_print(running,		atomic_read(&dc->running));
 	sysfs_print(state,		states[BDEV_STATE(&dc->sb)]);
@@ -365,7 +363,6 @@ STORE(__cached_dev)
 	sysfs_strtoul_clamp(sequential_cutoff,
 			    dc->sequential_cutoff,
 			    0, UINT_MAX);
-	d_strtoi_h(readahead);
 
 	if (attr == &sysfs_clear_stats)
 		bch_cache_accounting_clear(&dc->accounting);
@@ -538,7 +535,6 @@ static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_running,
 	&sysfs_state,
 	&sysfs_label,
-	&sysfs_readahead,
 #ifdef CONFIG_BCACHE_DEBUG
 	&sysfs_verify,
 	&sysfs_bypass_torture_test,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path
  2021-06-07 10:35 [PATCH 0/2] bcache fixes for Linux v5.13-rc6 Coly Li
  2021-06-07 10:35 ` [PATCH v5 1/2] bcache: remove bcache device self-defined readahead Coly Li
@ 2021-06-07 10:35 ` Coly Li
  2021-06-07 11:06   ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Coly Li @ 2021-06-07 10:35 UTC (permalink / raw)
  To: axboe
  Cc: linux-bcache, linux-block, linux-kernel, Coly Li,
	Alexander Ullrich, Diego Ercolani, Jan Szubiak, Marco Rebhan,
	Matthias Ferdinand, Victor Westerhuis, Vojtech Pavlik,
	Rolf Fokkens, Thorsten Knabe, stable, Christoph Hellwig,
	Kent Overstreet, Nix, Takashi Iwai

In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
  526         unsigned int sectors = KEY_INODE(k) == s->iop.inode
  527                 ? min_t(uint64_t, INT_MAX,
  528                         KEY_START(k) - bio->bi_iter.bi_sector)
  529                 : INT_MAX;
  530         int ret = s->d->cache_miss(b, s, bio, sectors);

Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.

Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,

1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
   886         BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
   51         default:
   52                 BUG();
   53                 return NULL;

All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.

Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
  909    s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);

Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
  911   s->iop.replace_key = KEY(s->iop.inode,
  912                    bio->bi_iter.bi_sector + s->insert_bio_sectors,
  913                    s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.

And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
  926    cache_bio = bio_alloc_bioset(GFP_NOWAIT,
  927                 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
  928                 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.

Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
  71 #define KEY(inode, offset, size)                                  \
  72 ((struct bkey) {                                                  \
  73      .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode),     \
  74      .low = (offset)                                              \
  75 })

Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
  915   ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);

Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).

Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
        min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.

From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
   bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().

From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.

Now the almost-useless readahead size for cache missing request back to
backing device is removed, this patch can fix the oversized issue with
more simpler method.
- add a local variable size_limit,  set it by the minimum value from
  the max bkey size and max bio bvecs number.
- set s->insert_bio_sectors by the minimum value from size_limit,
  sectors, and the sectors size of bio.
- replace sectors by s->insert_bio_sectors to do bio_next_split.

By the above method with size_limit, s->insert_bio_sectors will never
result oversized replace_key size or bio bvecs number. And split bio
'miss' from bio_next_split() will always match the size of 'cache_bio',
that is the current maximum bio size we can sent to backing device for
fetching the cache missing data.

Current problmatic code can be partially found since Linux v3.13-rc1,
therefore all maintained stable kernels should try to apply this fix.

Reported-by: Alexander Ullrich <ealex1979@gmail.com>
Reported-by: Diego Ercolani <diego.ercolani@gmail.com>
Reported-by: Jan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: Marco Rebhan <me@dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Reported-by: Victor Westerhuis <victor@westerhu.is>
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Reported-and-tested-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
---
Changelog,
v5, improvement and fix based on v4 comments from Christoph Hellwig
    and Nix.
v4, not directly access BIO_MAX_VECS and reduce reada value to avoid
    oversized bvecs number, by hint from Christoph Hellwig.
v3, fix typo in v2.
v2, fix the bypass bio size calculation in v1.
v1, the initial version.

 drivers/md/bcache/request.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab8ff18df32a..d855a8882cbc 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -882,6 +882,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	int ret = MAP_CONTINUE;
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	struct bio *miss, *cache_bio;
+	unsigned int size_limit;
 
 	s->cache_missed = 1;
 
@@ -891,7 +892,10 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 		goto out_submit;
 	}
 
-	s->insert_bio_sectors = min(sectors, bio_sectors(bio));
+	/* Limitation for valid replace key size and cache_bio bvecs number */
+	size_limit = min_t(unsigned int, bio_max_segs(UINT_MAX) * PAGE_SECTORS,
+			   (1 << KEY_SIZE_BITS) - 1);
+	s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));
 
 	s->iop.replace_key = KEY(s->iop.inode,
 				 bio->bi_iter.bi_sector + s->insert_bio_sectors,
@@ -903,7 +907,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 
 	s->iop.replace = true;
 
-	miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split);
+	miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO, &s->d->bio_split);
 
 	/* btree_search_recurse()'s btree iterator is no good anymore */
 	ret = miss == bio ? MAP_DONE : -EINTR;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] bcache: remove bcache device self-defined readahead
  2021-06-07 10:35 ` [PATCH v5 1/2] bcache: remove bcache device self-defined readahead Coly Li
@ 2021-06-07 11:00   ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2021-06-07 11:00 UTC (permalink / raw)
  To: Coly Li
  Cc: axboe, linux-bcache, linux-block, linux-kernel,
	Alexander Ullrich, Diego Ercolani, Jan Szubiak, Marco Rebhan,
	Matthias Ferdinand, Victor Westerhuis, Vojtech Pavlik,
	Rolf Fokkens, Thorsten Knabe, stable, Christoph Hellwig,
	Kent Overstreet, Nix, Takashi Iwai

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path
  2021-06-07 10:35 ` [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path Coly Li
@ 2021-06-07 11:06   ` Christoph Hellwig
  2021-06-07 11:55     ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2021-06-07 11:06 UTC (permalink / raw)
  To: Coly Li
  Cc: axboe, linux-bcache, linux-block, linux-kernel,
	Alexander Ullrich, Diego Ercolani, Jan Szubiak, Marco Rebhan,
	Matthias Ferdinand, Victor Westerhuis, Vojtech Pavlik,
	Rolf Fokkens, Thorsten Knabe, stable, Christoph Hellwig,
	Kent Overstreet, Nix, Takashi Iwai

On Mon, Jun 07, 2021 at 06:35:39PM +0800, Coly Li wrote:
> +	/* Limitation for valid replace key size and cache_bio bvecs number */
> +	size_limit = min_t(unsigned int, bio_max_segs(UINT_MAX) * PAGE_SECTORS,
> +			   (1 << KEY_SIZE_BITS) - 1);

bio_max_segs kaps the argument to BIO_MAX_VECS, so you might as well
directly write BIO_MAX_VECS.  Can you explain the PAGE_SECTORS here a bit
more? Does this code path use discontiguous per-sector allocations?
Preferably in a comment.

> +	s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));

Also I don't really understand the units involved here.
s->insert_bio_sectors, sectors, and bio_sectors is in unit of 512 byte
sectors.  

> -	miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split);
> +	miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO, &s->d->bio_split);

Overly long line.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path
  2021-06-07 11:06   ` Christoph Hellwig
@ 2021-06-07 11:55     ` Coly Li
  2021-06-07 12:08       ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Coly Li @ 2021-06-07 11:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, linux-bcache, linux-block, linux-kernel,
	Alexander Ullrich, Diego Ercolani, Jan Szubiak, Marco Rebhan,
	Matthias Ferdinand, Victor Westerhuis, Vojtech Pavlik,
	Rolf Fokkens, Thorsten Knabe, stable, Kent Overstreet, Nix,
	Takashi Iwai

On 6/7/21 7:06 PM, Christoph Hellwig wrote:
> On Mon, Jun 07, 2021 at 06:35:39PM +0800, Coly Li wrote:
>> +	/* Limitation for valid replace key size and cache_bio bvecs number */
>> +	size_limit = min_t(unsigned int, bio_max_segs(UINT_MAX) * PAGE_SECTORS,
>> +			   (1 << KEY_SIZE_BITS) - 1);
> bio_max_segs kaps the argument to BIO_MAX_VECS, so you might as well

It was suggested to not directly access BIO_MAX_VECS by you, maybe I
misunderstood you.


> directly write BIO_MAX_VECS.  Can you explain the PAGE_SECTORS here a bit
> more? Does this code path use discontiguous per-sector allocations?
> Preferably in a comment.


It is just because bch_bio_map() assume the maximum bio size is 1MB. It
was not true since the multiple pages bvecs
was merged in mainline kernel.
 
The PAGE_SECTORS part is legacy for 1MB maximum size bio (256*4KB), it
should be fixed/improved later to
use multiple pages for bio size > 1MB and replace bch_bio_map().


>> +	s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));
> Also I don't really understand the units involved here.
> s->insert_bio_sectors, sectors, and bio_sectors is in unit of 512 byte
> sectors.  

Yes, they are in sectors, this is the maximum permitted bio size for 1MB
bio size. Now we can have bio > 1MB, and you modify
bio_alloc_bioset() parameter 'nr_iovecs from unsigned int to unsigned
short, so bio-size/page-size can be > 256 and overflow,
e.g. 258 overflows to 2, then the BUG in biovec_slab() is triggered.

I feel this is a long time existing issue in bcache code, and should be
fixed from bcache side, and your change helps to trigger
the problem explicitly.

>> -	miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split);
>> +	miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO, &s->d->bio_split);
> Overly long line.

Not any more. Now the line limit is 100 characters. Though I still
prefer 80 characters, place 86 characters in single line
makes the change more obvious.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path
  2021-06-07 11:55     ` Coly Li
@ 2021-06-07 12:08       ` Christoph Hellwig
  2021-06-07 12:31         ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2021-06-07 12:08 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, axboe, linux-bcache, linux-block,
	linux-kernel, Alexander Ullrich, Diego Ercolani, Jan Szubiak,
	Marco Rebhan, Matthias Ferdinand, Victor Westerhuis,
	Vojtech Pavlik, Rolf Fokkens, Thorsten Knabe, stable,
	Kent Overstreet, Nix, Takashi Iwai

On Mon, Jun 07, 2021 at 07:55:22PM +0800, Coly Li wrote:
> On 6/7/21 7:06 PM, Christoph Hellwig wrote:
> > On Mon, Jun 07, 2021 at 06:35:39PM +0800, Coly Li wrote:
> >> +	/* Limitation for valid replace key size and cache_bio bvecs number */
> >> +	size_limit = min_t(unsigned int, bio_max_segs(UINT_MAX) * PAGE_SECTORS,
> >> +			   (1 << KEY_SIZE_BITS) - 1);
> > bio_max_segs kaps the argument to BIO_MAX_VECS, so you might as well
> 
> It was suggested to not directly access BIO_MAX_VECS by you, maybe I
> misunderstood you.

Yes, drivers really should not care about it.  But hiding that behind
a tiny wrapper doesn't help either.

> > directly write BIO_MAX_VECS.  Can you explain the PAGE_SECTORS here a bit
> > more? Does this code path use discontiguous per-sector allocations?
> > Preferably in a comment.
> 
> 
> It is just because bch_bio_map() assume the maximum bio size is 1MB. It
> was not true since the multiple pages bvecs
> was merged in mainline kernel.
>  
> The PAGE_SECTORS part is legacy for 1MB maximum size bio (256*4KB), it
> should be fixed/improved later to
> use multiple pages for bio size > 1MB and replace bch_bio_map().

bch_bio_map and bch_bio_alloc_pages that poke directly into the bio are
the root cause of a lot of these problems.

I had a series fixing some of that but Kent did not like it.  Drivers must
not diretly access bi_vcnt or directly build bios.

> Not any more. Now the line limit is 100 characters. Though I still
> prefer 80 characters, place 86 characters in single line
> makes the change more obvious.

It makes it really hard to read in a normal terminal.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path
  2021-06-07 12:08       ` Christoph Hellwig
@ 2021-06-07 12:31         ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2021-06-07 12:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, linux-bcache, linux-block, linux-kernel,
	Alexander Ullrich, Diego Ercolani, Jan Szubiak, Marco Rebhan,
	Matthias Ferdinand, Victor Westerhuis, Vojtech Pavlik,
	Rolf Fokkens, Thorsten Knabe, stable, Kent Overstreet, Nix,
	Takashi Iwai

On 6/7/21 8:08 PM, Christoph Hellwig wrote:
> On Mon, Jun 07, 2021 at 07:55:22PM +0800, Coly Li wrote:
>> On 6/7/21 7:06 PM, Christoph Hellwig wrote:
>>> On Mon, Jun 07, 2021 at 06:35:39PM +0800, Coly Li wrote:
>>>> +	/* Limitation for valid replace key size and cache_bio bvecs number */
>>>> +	size_limit = min_t(unsigned int, bio_max_segs(UINT_MAX) * PAGE_SECTORS,
>>>> +			   (1 << KEY_SIZE_BITS) - 1);
>>> bio_max_segs kaps the argument to BIO_MAX_VECS, so you might as well
>> It was suggested to not directly access BIO_MAX_VECS by you, maybe I
>> misunderstood you.
> Yes, drivers really should not care about it.  But hiding that behind
> a tiny wrapper doesn't help either.

OK, I change back to BIO_MAX_VECS.
>>> directly write BIO_MAX_VECS.  Can you explain the PAGE_SECTORS here a bit
>>> more? Does this code path use discontiguous per-sector allocations?
>>> Preferably in a comment.
>>
>> It is just because bch_bio_map() assume the maximum bio size is 1MB. It
>> was not true since the multiple pages bvecs
>> was merged in mainline kernel.
>>  
>> The PAGE_SECTORS part is legacy for 1MB maximum size bio (256*4KB), it
>> should be fixed/improved later to
>> use multiple pages for bio size > 1MB and replace bch_bio_map().
> bch_bio_map and bch_bio_alloc_pages that poke directly into the bio are
> the root cause of a lot of these problems.
>
> I had a series fixing some of that but Kent did not like it.  Drivers must
> not diretly access bi_vcnt or directly build bios.

Last time when you posted the series, Ming Lei's multipages bvecs were
not merged yet.

But the bcache code has kind of implicit restriction, e.g.
cached_dev_cache_miss(), the refill
cache missing 'cache_bio' must be a single bio and not chained,
otherwise the following
cached_dev_read_done() will have problem. Therefore I give up to fix
such thing in this
series, maybe there is no unique way to fix, and I also need to improve
the bio related stuffs
case by case.


>> Not any more. Now the line limit is 100 characters. Though I still
>> prefer 80 characters, place 86 characters in single line
>> makes the change more obvious.
> It makes it really hard to read in a normal terminal.

I don't know you still use 80 characters terminal for coding. Sure you
have my respect and I will serve your
request in next post.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path
  2021-05-31 15:31 [PATCH v5 1/2] bcache: remove bcache device self-defined readahead Coly Li
@ 2021-05-31 15:31 ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2021-05-31 15:31 UTC (permalink / raw)
  To: linux-bcache
  Cc: linux-block, linux-kernel, Coly Li, Alexander Ullrich,
	Diego Ercolani, Jan Szubiak, Marco Rebhan, Matthias Ferdinand,
	Thorsten Knabe, Victor Westerhuis, Vojtech Pavlik, stable,
	Christoph Hellwig, Kent Overstreet, Nix, Takashi Iwai

In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
  526         unsigned int sectors = KEY_INODE(k) == s->iop.inode
  527                 ? min_t(uint64_t, INT_MAX,
  528                         KEY_START(k) - bio->bi_iter.bi_sector)
  529                 : INT_MAX;
  530         int ret = s->d->cache_miss(b, s, bio, sectors);

Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.

Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,

1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
   886         BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
   51         default:
   52                 BUG();
   53                 return NULL;

All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.

Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
  909    s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);

Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
  911   s->iop.replace_key = KEY(s->iop.inode,
  912                    bio->bi_iter.bi_sector + s->insert_bio_sectors,
  913                    s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.

And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
  926    cache_bio = bio_alloc_bioset(GFP_NOWAIT,
  927                 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
  928                 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.

Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
  71 #define KEY(inode, offset, size)                                  \
  72 ((struct bkey) {                                                  \
  73      .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode),     \
  74      .low = (offset)                                              \
  75 })

Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
  915   ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);

Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).

Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
        min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.

From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
   bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().

From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.

Now the almost-useless readahead size for cache missing request back to
backing device is removed, this patch can fix the oversized issue with
more simpler method.
- add a local variable size_limit,  set it by the minimum value from
  the max bkey size and max bio bvecs number.
- set s->insert_bio_sectors by the minimum value from size_limit,
  sectors, and the sectors size of bio.
- replace sectors by s->insert_bio_sectors to do bio_next_split.

By the above method with size_limit, s->insert_bio_sectors will never
result oversized replace_key size or bio bvecs number. And split bio
'miss' from bio_next_split() will always match the size of 'cache_bio',
that is the current maximum bio size we can sent to backing device for
fetching the cache missing data.

Current problmatic code can be partially found since Linux v3.13-rc1,
therefore all maintained stable kernels should try to apply this fix.

Reported-by: Alexander Ullrich <ealex1979@gmail.com>
Reported-by: Diego Ercolani <diego.ercolani@gmail.com>
Reported-by: Jan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: Marco Rebhan <me@dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Reported-by: Thorsten Knabe <linux@thorsten-knabe.de>
Reported-by: Victor Westerhuis <victor@westerhu.is>
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
---
Changelog,
v5, improvement and fix based on v4 comments from Christoph Hellwig
    and Nix.
v4, not directly access BIO_MAX_VECS and reduce reada value to avoid
    oversized bvecs number, by hint from Christoph Hellwig.
v3, fix typo in v2.
v2, fix the bypass bio size calculation in v1.
v1, the initial version.

 drivers/md/bcache/request.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab8ff18df32a..d855a8882cbc 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -882,6 +882,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 	int ret = MAP_CONTINUE;
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	struct bio *miss, *cache_bio;
+	unsigned int size_limit;
 
 	s->cache_missed = 1;
 
@@ -891,7 +892,10 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 		goto out_submit;
 	}
 
-	s->insert_bio_sectors = min(sectors, bio_sectors(bio));
+	/* Limitation for valid replace key size and cache_bio bvecs number */
+	size_limit = min_t(unsigned int, bio_max_segs(UINT_MAX) * PAGE_SECTORS,
+			   (1 << KEY_SIZE_BITS) - 1);
+	s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));
 
 	s->iop.replace_key = KEY(s->iop.inode,
 				 bio->bi_iter.bi_sector + s->insert_bio_sectors,
@@ -903,7 +907,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 
 	s->iop.replace = true;
 
-	miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split);
+	miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO, &s->d->bio_split);
 
 	/* btree_search_recurse()'s btree iterator is no good anymore */
 	ret = miss == bio ? MAP_DONE : -EINTR;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-07 12:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 10:35 [PATCH 0/2] bcache fixes for Linux v5.13-rc6 Coly Li
2021-06-07 10:35 ` [PATCH v5 1/2] bcache: remove bcache device self-defined readahead Coly Li
2021-06-07 11:00   ` Christoph Hellwig
2021-06-07 10:35 ` [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path Coly Li
2021-06-07 11:06   ` Christoph Hellwig
2021-06-07 11:55     ` Coly Li
2021-06-07 12:08       ` Christoph Hellwig
2021-06-07 12:31         ` Coly Li
  -- strict thread matches above, loose matches on Subject: below --
2021-05-31 15:31 [PATCH v5 1/2] bcache: remove bcache device self-defined readahead Coly Li
2021-05-31 15:31 ` [PATCH v5 2/2] bcache: avoid oversized read request in cache missing code path Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).