linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: bcache super block corruption with non 4k pages
       [not found] <1469091513-11233-1-git-send-email-stefan.bader@canonical.com>
@ 2016-07-26  9:51 ` Stefan Bader
  2016-07-26 10:21   ` Kent Overstreet
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Bader @ 2016-07-26  9:51 UTC (permalink / raw)
  To: linux-bcache, dm-devel, Linux Kernel Mailing List
  Cc: liuzhengyuang521, kent.overstreet, bcache, apw, Stefan Bader

[-- Attachment #1: Type: text/plain, Size: 8005 bytes --]

On 21.07.2016 10:58, Stefan Bader wrote:
> I was pointed at the thread which seems to address the same after
> I wrote most of below text. Did not want to re-write this so please
> bear with the odd layout.
> 
> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
> 
> Zhengyuan tries to fix the problem by relocating the superblock on
> disk. But I am not sure whether there is really any guarantee about
> how __bread fills data into the buffer_head. What if there is the next
> odd arch with 128K pages?
> 
> So below is an attempt to be more generic. Still I don't feel completely
> happy with the way that a page moves (or is shared) between buffer_head
> and biovec. What I tried to outline below is to let the register functions
> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
> the biovec buffer.

Any opinions here? Also adding LKML as I don't seem to get through moderation on
dm-devel.


> 
> -Stefan
> 
> 
> ---
> 
> This was seen on ppc64le with 64k page size. The problem is that
> in that case __bread returns a page where b_data is not at the
> start of the page. And I am not really sure that the way bcache
> re-purposes the page returned by __bread to be used as biovec 
> element is really a good idea.
> 
> The way it is now, I saw __bread return a 64k page where b_data
> was starting at 4k offset. Then __write_super was modifying some
> data at offset 0 but for example not writing the magic number again.
> 
> Not sure why (maybe this messes up flushes, too) but the bad data was not
> immediately written back when the devices were registered. So at that time
> bcache-super-show would report data as it was written by user-space (like
> the cache type still 0 and not 3, and claiming the cache to still bei
> detached).
> 
> But when stopping the cache and unregistering the cache set this changed
> and bcache-super-show was reporting a bad magic number (as expected).
> 
> The patch below works around this (tested with 64k and 4k pages) but I
> am not really sure this is a clean approach. My gut feeling would be that
> the bio structures should be allocated speperately (I think there is a way
> of allocating a bioset without using a pool). Then that separate page could
> be pre-initialized from the super block structure without passing around
> a page the feels more private to __bread usage...
> 
> -Stefan
> 
> 
> 
> From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader@canonical.com>
> Date: Wed, 20 Jul 2016 12:06:27 +0200
> Subject: [PATCH] bcache: handle non 4k pages returned by __bread
> 
> On non-x86 arches pages can be bigger than 4k. Currently read_super
> returns a reference to the page used as buffer in the buffer_head
> that is returned by __bread().
> With 4k page size and a requested read this normally ends up with 
> the super block data starting at offset 0. But as seen on ppc64le
> with 64k page size, the data can start at an offset (from what I
> saw the offset was 4k).
> This causes harm later on when __write_super() maps the super
> block data at the start of the page acquired before and also
> not writes back all fields (particularly the super block magic).
> 
> Try to avoid this by also returning the potential offset within the
> sb_page from read_super() and fully initiallize the single bvec of
> the bio used for __write_super() calls. Doing that is the same as
> would have been done in bch_bio_map() which now must not be used in
> __write_super().
> 
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> 
> removedebug
> 
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  drivers/md/bcache/super.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e169739..3e0d2de 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
>  /* Superblock */
>  
>  static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> -			      struct page **res)
> +			      struct page **res, unsigned int *off)
>  {
>  	const char *err;
>  	struct cache_sb *s;
> @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
>  	err = NULL;
>  
>  	get_page(bh->b_page);
> +	*off = (unsigned int) (bh->b_data - ((char *) page_address(bh->b_page)));
>  	*res = bh->b_page;
>  err:
>  	put_bh(bh);
> @@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio)
>  {
>  	struct cached_dev *dc = bio->bi_private;
>  	/* XXX: error checking */
> -
>  	closure_put(&dc->sb_write);
>  }
>  
>  static void __write_super(struct cache_sb *sb, struct bio *bio)
>  {
> -	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
> +	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) +
> +			       bio->bi_io_vec[0].bv_offset;
>  	unsigned i;
>  
>  	bio->bi_iter.bi_sector	= SB_SECTOR;
>  	bio->bi_rw		= REQ_SYNC|REQ_META;
>  	bio->bi_iter.bi_size	= SB_SIZE;
> -	bch_bio_map(bio, NULL);
> +	// bch_bio_map(bio, NULL);
>  
>  	out->offset		= cpu_to_le64(sb->offset);
>  	out->version		= cpu_to_le64(sb->version);
> @@ -1139,6 +1140,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
>  /* Cached device - bcache superblock */
>  
>  static void register_bdev(struct cache_sb *sb, struct page *sb_page,
> +				 unsigned int sb_off,
>  				 struct block_device *bdev,
>  				 struct cached_dev *dc)
>  {
> @@ -1154,6 +1156,8 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>  	dc->sb_bio.bi_max_vecs	= 1;
>  	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
>  	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
> +	dc->sb_bio.bi_io_vec[0].bv_len = SB_SIZE;
> +	dc->sb_bio.bi_io_vec[0].bv_offset = sb_off;
>  	get_page(sb_page);
>  
>  	if (cached_dev_init(dc, sb->block_size << 9))
> @@ -1839,6 +1843,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
>  }
>  
>  static int register_cache(struct cache_sb *sb, struct page *sb_page,
> +				unsigned int sb_off,
>  				struct block_device *bdev, struct cache *ca)
>  {
>  	char name[BDEVNAME_SIZE];
> @@ -1853,6 +1858,8 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
>  	ca->sb_bio.bi_max_vecs	= 1;
>  	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
>  	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
> +	ca->sb_bio.bi_io_vec[0].bv_len  = SB_SIZE;
> +	ca->sb_bio.bi_io_vec[0].bv_offset = sb_off;
>  	get_page(sb_page);
>  
>  	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
> @@ -1936,6 +1943,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  	struct cache_sb *sb = NULL;
>  	struct block_device *bdev = NULL;
>  	struct page *sb_page = NULL;
> +	unsigned int sb_off;
>  
>  	if (!try_module_get(THIS_MODULE))
>  		return -EBUSY;
> @@ -1967,7 +1975,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  	if (set_blocksize(bdev, 4096))
>  		goto err_close;
>  
> -	err = read_super(sb, bdev, &sb_page);
> +	err = read_super(sb, bdev, &sb_page, &sb_off);
>  	if (err)
>  		goto err_close;
>  
> @@ -1977,14 +1985,14 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  			goto err_close;
>  
>  		mutex_lock(&bch_register_lock);
> -		register_bdev(sb, sb_page, bdev, dc);
> +		register_bdev(sb, sb_page, sb_off, bdev, dc);
>  		mutex_unlock(&bch_register_lock);
>  	} else {
>  		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
>  		if (!ca)
>  			goto err_close;
>  
> -		if (register_cache(sb, sb_page, bdev, ca) != 0)
> +		if (register_cache(sb, sb_page, sb_off, bdev, ca) != 0)
>  			goto err_close;
>  	}
>  out:
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-26  9:51 ` bcache super block corruption with non 4k pages Stefan Bader
@ 2016-07-26 10:21   ` Kent Overstreet
  2016-07-26 12:32     ` Stefan Bader
  0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2016-07-26 10:21 UTC (permalink / raw)
  To: Stefan Bader
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw

On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
> On 21.07.2016 10:58, Stefan Bader wrote:
> > I was pointed at the thread which seems to address the same after
> > I wrote most of below text. Did not want to re-write this so please
> > bear with the odd layout.
> > 
> > https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
> > 
> > Zhengyuan tries to fix the problem by relocating the superblock on
> > disk. But I am not sure whether there is really any guarantee about
> > how __bread fills data into the buffer_head. What if there is the next
> > odd arch with 128K pages?
> > 
> > So below is an attempt to be more generic. Still I don't feel completely
> > happy with the way that a page moves (or is shared) between buffer_head
> > and biovec. What I tried to outline below is to let the register functions
> > allocate bio+biovec memory and use the in-memory sb_cache data to initialize
> > the biovec buffer.
> 
> Any opinions here? Also adding LKML as I don't seem to get through moderation on
> dm-devel.

The correct solution is to rip out the __bread() and just read the superblock by
issuing a bio, the same way all the other IO in bcache is done.

This is the way it's done in the bcache-dev branch - unfortunately, the patch
that does that in bcache-dev is big and invasive and probably not worth the
hassle to backport:

https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev&id=303eb67bffad57b4d9e71523e7df04bf258e66d1

Probably best to just do something small and localized.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-26 10:21   ` Kent Overstreet
@ 2016-07-26 12:32     ` Stefan Bader
  2016-07-26 12:49       ` Kent Overstreet
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Bader @ 2016-07-26 12:32 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw, Stefan Bader


[-- Attachment #1.1: Type: text/plain, Size: 2008 bytes --]

On 26.07.2016 12:21, Kent Overstreet wrote:
> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
>> On 21.07.2016 10:58, Stefan Bader wrote:
>>> I was pointed at the thread which seems to address the same after
>>> I wrote most of below text. Did not want to re-write this so please
>>> bear with the odd layout.
>>>
>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
>>>
>>> Zhengyuan tries to fix the problem by relocating the superblock on
>>> disk. But I am not sure whether there is really any guarantee about
>>> how __bread fills data into the buffer_head. What if there is the next
>>> odd arch with 128K pages?
>>>
>>> So below is an attempt to be more generic. Still I don't feel completely
>>> happy with the way that a page moves (or is shared) between buffer_head
>>> and biovec. What I tried to outline below is to let the register functions
>>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
>>> the biovec buffer.
>>
>> Any opinions here? Also adding LKML as I don't seem to get through moderation on
>> dm-devel.
> 
> The correct solution is to rip out the __bread() and just read the superblock by
> issuing a bio, the same way all the other IO in bcache is done.
> 
> This is the way it's done in the bcache-dev branch - unfortunately, the patch
> that does that in bcache-dev is big and invasive and probably not worth the
> hassle to backport:
> 
> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev&id=303eb67bffad57b4d9e71523e7df04bf258e66d1

I agree that this looks better and also rather large.
> 
> Probably best to just do something small and localized.
> 
So what did you think about the change I did? It seemed to be ok for 4K and 64K
at least and is rather small. And I believe that, compared to Zhengyuan's
approach this would have the benefit of not changing the superblock sector. So
it would be compatible with previously created superblocks.

-Stefan



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-bcache-stable-fix-sb.patch --]
[-- Type: text/x-diff; name="0001-bcache-stable-fix-sb.patch", Size: 5197 bytes --]

From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Wed, 20 Jul 2016 12:06:27 +0200
Subject: [PATCH] bcache: handle non 4k pages returned by __bread

On non-x86 arches pages can be bigger than 4k. Currently read_super
returns a reference to the page used as buffer in the buffer_head
that is returned by __bread().
With 4k page size and a requested read this normally ends up with 
the super block data starting at offset 0. But as seen on ppc64le
with 64k page size, the data can start at an offset (from what I
saw the offset was 4k).
This causes harm later on when __write_super() maps the super
block data at the start of the page acquired before and also
not writes back all fields (particularly the super block magic).

Try to avoid this by also returning the potential offset within the
sb_page from read_super() and fully initiallize the single bvec of
the bio used for __write_super() calls. Doing that is the same as
would have been done in bch_bio_map() which now must not be used in
__write_super().

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

---
 drivers/md/bcache/super.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e169739..3e0d2de 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			      struct page **res)
+			      struct page **res, unsigned int *off)
 {
 	const char *err;
 	struct cache_sb *s;
@@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	err = NULL;
 
 	get_page(bh->b_page);
+	*off = (unsigned int) (bh->b_data - ((char *) page_address(bh->b_page)));
 	*res = bh->b_page;
 err:
 	put_bh(bh);
@@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio)
 {
 	struct cached_dev *dc = bio->bi_private;
 	/* XXX: error checking */
-
 	closure_put(&dc->sb_write);
 }
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) +
+			       bio->bi_io_vec[0].bv_offset;
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
 	bio->bi_rw		= REQ_SYNC|REQ_META;
 	bio->bi_iter.bi_size	= SB_SIZE;
-	bch_bio_map(bio, NULL);
+	// bch_bio_map(bio, NULL);
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -1139,6 +1140,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
 /* Cached device - bcache superblock */
 
 static void register_bdev(struct cache_sb *sb, struct page *sb_page,
+				 unsigned int sb_off,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1154,6 +1156,8 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	dc->sb_bio.bi_max_vecs	= 1;
 	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
 	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
+	dc->sb_bio.bi_io_vec[0].bv_len = SB_SIZE;
+	dc->sb_bio.bi_io_vec[0].bv_offset = sb_off;
 	get_page(sb_page);
 
 	if (cached_dev_init(dc, sb->block_size << 9))
@@ -1839,6 +1843,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
 }
 
 static int register_cache(struct cache_sb *sb, struct page *sb_page,
+				unsigned int sb_off,
 				struct block_device *bdev, struct cache *ca)
 {
 	char name[BDEVNAME_SIZE];
@@ -1853,6 +1858,8 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 	ca->sb_bio.bi_max_vecs	= 1;
 	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
 	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
+	ca->sb_bio.bi_io_vec[0].bv_len  = SB_SIZE;
+	ca->sb_bio.bi_io_vec[0].bv_offset = sb_off;
 	get_page(sb_page);
 
 	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
@@ -1936,6 +1943,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	struct cache_sb *sb = NULL;
 	struct block_device *bdev = NULL;
 	struct page *sb_page = NULL;
+	unsigned int sb_off;
 
 	if (!try_module_get(THIS_MODULE))
 		return -EBUSY;
@@ -1967,7 +1975,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (set_blocksize(bdev, 4096))
 		goto err_close;
 
-	err = read_super(sb, bdev, &sb_page);
+	err = read_super(sb, bdev, &sb_page, &sb_off);
 	if (err)
 		goto err_close;
 
@@ -1977,14 +1985,14 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto err_close;
 
 		mutex_lock(&bch_register_lock);
-		register_bdev(sb, sb_page, bdev, dc);
+		register_bdev(sb, sb_page, sb_off, bdev, dc);
 		mutex_unlock(&bch_register_lock);
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 		if (!ca)
 			goto err_close;
 
-		if (register_cache(sb, sb_page, bdev, ca) != 0)
+		if (register_cache(sb, sb_page, sb_off, bdev, ca) != 0)
 			goto err_close;
 	}
 out:
-- 
1.9.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-26 12:32     ` Stefan Bader
@ 2016-07-26 12:49       ` Kent Overstreet
  2016-07-27 15:16         ` Stefan Bader
  0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2016-07-26 12:49 UTC (permalink / raw)
  To: Stefan Bader
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw

On Tue, Jul 26, 2016 at 02:32:31PM +0200, Stefan Bader wrote:
> On 26.07.2016 12:21, Kent Overstreet wrote:
> > On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
> >> On 21.07.2016 10:58, Stefan Bader wrote:
> >>> I was pointed at the thread which seems to address the same after
> >>> I wrote most of below text. Did not want to re-write this so please
> >>> bear with the odd layout.
> >>>
> >>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
> >>>
> >>> Zhengyuan tries to fix the problem by relocating the superblock on
> >>> disk. But I am not sure whether there is really any guarantee about
> >>> how __bread fills data into the buffer_head. What if there is the next
> >>> odd arch with 128K pages?
> >>>
> >>> So below is an attempt to be more generic. Still I don't feel completely
> >>> happy with the way that a page moves (or is shared) between buffer_head
> >>> and biovec. What I tried to outline below is to let the register functions
> >>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
> >>> the biovec buffer.
> >>
> >> Any opinions here? Also adding LKML as I don't seem to get through moderation on
> >> dm-devel.
> > 
> > The correct solution is to rip out the __bread() and just read the superblock by
> > issuing a bio, the same way all the other IO in bcache is done.
> > 
> > This is the way it's done in the bcache-dev branch - unfortunately, the patch
> > that does that in bcache-dev is big and invasive and probably not worth the
> > hassle to backport:
> > 
> > https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev&id=303eb67bffad57b4d9e71523e7df04bf258e66d1
> 
> I agree that this looks better and also rather large.
> > 
> > Probably best to just do something small and localized.
> > 
> So what did you think about the change I did? It seemed to be ok for 4K and 64K
> at least and is rather small. And I believe that, compared to Zhengyuan's
> approach this would have the benefit of not changing the superblock sector. So
> it would be compatible with previously created superblocks.

Too ugly to live. Just kmalloc() 4k, allocate a bio on the stack, set it up, and
submit it with submit_bio_wait(). Use virt_to_page(), don't bother with raw
pages - you want 4k, not whatever the page size is.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-26 12:49       ` Kent Overstreet
@ 2016-07-27 15:16         ` Stefan Bader
  2016-07-28  5:55           ` Kent Overstreet
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Bader @ 2016-07-27 15:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw, Stefan Bader


[-- Attachment #1.1: Type: text/plain, Size: 3644 bytes --]

On 26.07.2016 14:49, Kent Overstreet wrote:
> On Tue, Jul 26, 2016 at 02:32:31PM +0200, Stefan Bader wrote:
>> On 26.07.2016 12:21, Kent Overstreet wrote:
>>> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
>>>> On 21.07.2016 10:58, Stefan Bader wrote:
>>>>> I was pointed at the thread which seems to address the same after
>>>>> I wrote most of below text. Did not want to re-write this so please
>>>>> bear with the odd layout.
>>>>>
>>>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
>>>>>
>>>>> Zhengyuan tries to fix the problem by relocating the superblock on
>>>>> disk. But I am not sure whether there is really any guarantee about
>>>>> how __bread fills data into the buffer_head. What if there is the next
>>>>> odd arch with 128K pages?
>>>>>
>>>>> So below is an attempt to be more generic. Still I don't feel completely
>>>>> happy with the way that a page moves (or is shared) between buffer_head
>>>>> and biovec. What I tried to outline below is to let the register functions
>>>>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
>>>>> the biovec buffer.
>>>>
>>>> Any opinions here? Also adding LKML as I don't seem to get through moderation on
>>>> dm-devel.
>>>
>>> The correct solution is to rip out the __bread() and just read the superblock by
>>> issuing a bio, the same way all the other IO in bcache is done.
>>>
>>> This is the way it's done in the bcache-dev branch - unfortunately, the patch
>>> that does that in bcache-dev is big and invasive and probably not worth the
>>> hassle to backport:
>>>
>>> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev&id=303eb67bffad57b4d9e71523e7df04bf258e66d1
>>
>> I agree that this looks better and also rather large.
>>>
>>> Probably best to just do something small and localized.
>>>
>> So what did you think about the change I did? It seemed to be ok for 4K and 64K
>> at least and is rather small. And I believe that, compared to Zhengyuan's
>> approach this would have the benefit of not changing the superblock sector. So
>> it would be compatible with previously created superblocks.
> 
> Too ugly to live. Just kmalloc() 4k, allocate a bio on the stack, set it up, and
> submit it with submit_bio_wait(). Use virt_to_page(), don't bother with raw
> pages - you want 4k, not whatever the page size is.
> 

So here is another attempt which does half the proposed changes. And before you
point out that it looks still ugly, let me explain some reasons. The goal for me
would be to have something as small as possible to be acceptable as stable change.
And the part about putting a bio on the stack and using submit_bio_wait: I
believe you meant in read_super to replace the __bread. I was thinking about
that but in the end it seemed to make the change unnecessary big. Whether using
__bread or submit_bio_wait, in both cases and without needing to make more
changes on the write side, read_super has to return the in-memory and on-disk
variant of the superblock. So as long as nothing that is related to __bread is
leaked out of read_super, it is much better than what is there now. And I remain
as small as possible for the delta.

So there is one part of the patch which I find hard to do in a better manner but
is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
paths but not on success when it is assigned to either cache or cached_dev.
Could possibly pass the address of the pointer and then clear it inside the
called functions. Not sure that would be much less strange...


-Stefan



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-bcache-read_super-handle-architectures-with-more-tha.patch --]
[-- Type: text/x-diff; name="0001-bcache-read_super-handle-architectures-with-more-tha.patch", Size: 7939 bytes --]

From 391682e2329a6f8608bfc7628b6c8a0afa9a5d98 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 61 ++++++++++++++++++++++++++--------------------
 2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a5..3c48927 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -295,6 +295,7 @@ struct cached_dev {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 	struct closure		sb_write;
 	struct semaphore	sb_write_mutex;
 
@@ -382,6 +383,7 @@ struct cache {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 
 	struct kobject		kobj;
 	struct block_device	*bdev;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e169739..f017f69 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			      struct page **res)
+			      void *sb_data)
 {
 	const char *err;
 	struct cache_sb *s;
@@ -191,8 +191,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	sb->last_mount = get_seconds();
 	err = NULL;
 
-	get_page(bh->b_page);
-	*res = bh->b_page;
+	memcpy(sb_data, bh->b_data, SB_SIZE);
 err:
 	put_bh(bh);
 	return err;
@@ -206,15 +205,15 @@ static void write_bdev_super_endio(struct bio *bio)
 	closure_put(&dc->sb_write);
 }
 
-static void __write_super(struct cache_sb *sb, struct bio *bio)
+static void __write_super(struct cache_sb *sb, struct bio *bio, void *sb_data)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = sb_data;
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
 	bio->bi_rw		= REQ_SYNC|REQ_META;
 	bio->bi_iter.bi_size	= SB_SIZE;
-	bch_bio_map(bio, NULL);
+	bch_bio_map(bio, sb_data);
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -262,7 +261,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent)
 	bio->bi_private = dc;
 
 	closure_get(cl);
-	__write_super(&dc->sb, bio);
+	__write_super(&dc->sb, bio, dc->sb_disk_data);
 
 	closure_return_with_destructor(cl, bch_write_bdev_super_unlock);
 }
@@ -308,7 +307,7 @@ void bcache_write_super(struct cache_set *c)
 		bio->bi_private = ca;
 
 		closure_get(cl);
-		__write_super(&ca->sb, bio);
+		__write_super(&ca->sb, bio, ca->sb_disk_data);
 	}
 
 	closure_return_with_destructor(cl, bcache_write_super_unlock);
@@ -1045,6 +1044,8 @@ void bch_cached_dev_release(struct kobject *kobj)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
+
+	kfree(dc->sb_disk_data);
 	kfree(dc);
 	module_put(THIS_MODULE);
 }
@@ -1138,7 +1139,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
 
 /* Cached device - bcache superblock */
 
-static void register_bdev(struct cache_sb *sb, struct page *sb_page,
+static void register_bdev(struct cache_sb *sb, void *sb_disk_data,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1152,9 +1153,8 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&dc->sb_bio);
 	dc->sb_bio.bi_max_vecs	= 1;
-	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
-	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
-	get_page(sb_page);
+	dc->sb_bio.bi_io_vec	= &dc->sb_bv[0];
+	dc->sb_disk_data	= sb_disk_data;
 
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
@@ -1179,6 +1179,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	return;
 err:
 	pr_notice("error opening %s: %s", bdevname(bdev, name), err);
+	kfree(sb_disk_data);
 	bcache_device_stop(&dc->disk);
 }
 
@@ -1793,8 +1794,7 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
-	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
-		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
+	kfree(ca->sb_disk_data);
 
 	if (!IS_ERR_OR_NULL(ca->bdev))
 		blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1838,7 +1838,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
 	return 0;
 }
 
-static int register_cache(struct cache_sb *sb, struct page *sb_page,
+static int register_cache(struct cache_sb *sb, void *sb_disk_data,
 				struct block_device *bdev, struct cache *ca)
 {
 	char name[BDEVNAME_SIZE];
@@ -1851,16 +1851,17 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&ca->sb_bio);
 	ca->sb_bio.bi_max_vecs	= 1;
-	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
-	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
-	get_page(sb_page);
+	ca->sb_bio.bi_io_vec	= &ca->sb_bv[0];
+	ca->sb_disk_data	= sb_disk_data;
 
 	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
 		ca->discard = CACHE_DISCARD(&ca->sb);
 
 	ret = cache_alloc(sb, ca);
-	if (ret != 0)
+	if (ret != 0) {
+		err = "error calling cache_alloc";
 		goto err;
+	}
 
 	if (kobject_add(&ca->kobj, &part_to_dev(bdev->bd_part)->kobj, "bcache")) {
 		err = "error calling kobject_add";
@@ -1883,8 +1884,10 @@ out:
 	kobject_put(&ca->kobj);
 
 err:
-	if (err)
+	if (err) {
 		pr_notice("error opening %s: %s", bdevname(bdev, name), err);
+		kfree(sb_disk_data);
+	}
 
 	return ret;
 }
@@ -1935,13 +1938,14 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	char *path = NULL;
 	struct cache_sb *sb = NULL;
 	struct block_device *bdev = NULL;
-	struct page *sb_page = NULL;
+	void *sb_disk_data = NULL;
 
 	if (!try_module_get(THIS_MODULE))
 		return -EBUSY;
 
 	if (!(path = kstrndup(buffer, size, GFP_KERNEL)) ||
-	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)))
+	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)) ||
+	    !(sb_disk_data = kmalloc(SB_SIZE, GFP_KERNEL)))
 		goto err;
 
 	err = "failed to open device";
@@ -1967,7 +1971,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (set_blocksize(bdev, 4096))
 		goto err_close;
 
-	err = read_super(sb, bdev, &sb_page);
+	err = read_super(sb, bdev, sb_disk_data);
 	if (err)
 		goto err_close;
 
@@ -1977,20 +1981,23 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto err_close;
 
 		mutex_lock(&bch_register_lock);
-		register_bdev(sb, sb_page, bdev, dc);
+		register_bdev(sb, sb_disk_data, bdev, dc);
+		sb_disk_data = NULL; /* Consumed or freed in register call */
 		mutex_unlock(&bch_register_lock);
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 		if (!ca)
 			goto err_close;
 
-		if (register_cache(sb, sb_page, bdev, ca) != 0)
+		if (register_cache(sb, sb_disk_data, bdev, ca) != 0) {
+			sb_disk_data = NULL;
 			goto err_close;
+		}
+		sb_disk_data = NULL;
 	}
 out:
-	if (sb_page)
-		put_page(sb_page);
 	kfree(sb);
+	kfree(sb_disk_data);
 	kfree(path);
 	module_put(THIS_MODULE);
 	return ret;
-- 
1.9.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-27 15:16         ` Stefan Bader
@ 2016-07-28  5:55           ` Kent Overstreet
  2016-07-28 16:27             ` Stefan Bader
  0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2016-07-28  5:55 UTC (permalink / raw)
  To: Stefan Bader
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw

On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
> So here is another attempt which does half the proposed changes. And before you
> point out that it looks still ugly, let me explain some reasons. The goal for me
> would be to have something as small as possible to be acceptable as stable change.
> And the part about putting a bio on the stack and using submit_bio_wait: I
> believe you meant in read_super to replace the __bread. I was thinking about
> that but in the end it seemed to make the change unnecessary big. Whether using
> __bread or submit_bio_wait, in both cases and without needing to make more
> changes on the write side, read_super has to return the in-memory and on-disk
> variant of the superblock. So as long as nothing that is related to __bread is
> leaked out of read_super, it is much better than what is there now. And I remain
> as small as possible for the delta.

I like that approach much better. I suppose it's not _strictly_ necessary to rip
out the __bread()...

Only other comment is that you shouldn't have to pass the buffer to
__write_super() - I'd just move the bch_bio_map() call to when the struct
cache/cached_dev is being initialized (or rip it out and initialize the bvec by
hand) and never touch it after that.

> So there is one part of the patch which I find hard to do in a better manner but
> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
> paths but not on success when it is assigned to either cache or cached_dev.
> Could possibly pass the address of the pointer and then clear it inside the
> called functions. Not sure that would be much less strange...

Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I
added that "disk_sb" struct - it owns the buffer and random other crap. You
could read that patch for ideas if you want, look at how it transfers ownership
of the disk_sb. 

> From 391682e2329a6f8608bfc7628b6c8a0afa9a5d98 Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader@canonical.com>
> Date: Tue, 26 Jul 2016 18:47:21 +0200
> Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
>  page size
> 
> There is no guarantee that the superblock which __bread returns in
> a buffer_head starts at offset 0 when an architecture has bigger
> pages than 4k (the used sector size).
> 
> This is the attempt to fix this with the minimum amount of change
> by having a buffer allocated with kmalloc that holds the superblock
> data as it is on disk.
> This buffer can then be passed to bch_map_bio which will set up the
> bio_vec correctly.
> 
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  drivers/md/bcache/bcache.h |  2 ++
>  drivers/md/bcache/super.c  | 61 ++++++++++++++++++++++++++--------------------
>  2 files changed, 36 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 6b420a5..3c48927 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -295,6 +295,7 @@ struct cached_dev {
>  	struct cache_sb		sb;
>  	struct bio		sb_bio;
>  	struct bio_vec		sb_bv[1];
> +	void			*sb_disk_data;
>  	struct closure		sb_write;
>  	struct semaphore	sb_write_mutex;
>  
> @@ -382,6 +383,7 @@ struct cache {
>  	struct cache_sb		sb;
>  	struct bio		sb_bio;
>  	struct bio_vec		sb_bv[1];
> +	void			*sb_disk_data;
>  
>  	struct kobject		kobj;
>  	struct block_device	*bdev;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e169739..f017f69 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
>  /* Superblock */
>  
>  static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> -			      struct page **res)
> +			      void *sb_data)
>  {
>  	const char *err;
>  	struct cache_sb *s;
> @@ -191,8 +191,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
>  	sb->last_mount = get_seconds();
>  	err = NULL;
>  
> -	get_page(bh->b_page);
> -	*res = bh->b_page;
> +	memcpy(sb_data, bh->b_data, SB_SIZE);
>  err:
>  	put_bh(bh);
>  	return err;
> @@ -206,15 +205,15 @@ static void write_bdev_super_endio(struct bio *bio)
>  	closure_put(&dc->sb_write);
>  }
>  
> -static void __write_super(struct cache_sb *sb, struct bio *bio)
> +static void __write_super(struct cache_sb *sb, struct bio *bio, void *sb_data)
>  {
> -	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
> +	struct cache_sb *out = sb_data;
>  	unsigned i;
>  
>  	bio->bi_iter.bi_sector	= SB_SECTOR;
>  	bio->bi_rw		= REQ_SYNC|REQ_META;
>  	bio->bi_iter.bi_size	= SB_SIZE;
> -	bch_bio_map(bio, NULL);
> +	bch_bio_map(bio, sb_data);
>  
>  	out->offset		= cpu_to_le64(sb->offset);
>  	out->version		= cpu_to_le64(sb->version);
> @@ -262,7 +261,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent)
>  	bio->bi_private = dc;
>  
>  	closure_get(cl);
> -	__write_super(&dc->sb, bio);
> +	__write_super(&dc->sb, bio, dc->sb_disk_data);
>  
>  	closure_return_with_destructor(cl, bch_write_bdev_super_unlock);
>  }
> @@ -308,7 +307,7 @@ void bcache_write_super(struct cache_set *c)
>  		bio->bi_private = ca;
>  
>  		closure_get(cl);
> -		__write_super(&ca->sb, bio);
> +		__write_super(&ca->sb, bio, ca->sb_disk_data);
>  	}
>  
>  	closure_return_with_destructor(cl, bcache_write_super_unlock);
> @@ -1045,6 +1044,8 @@ void bch_cached_dev_release(struct kobject *kobj)
>  {
>  	struct cached_dev *dc = container_of(kobj, struct cached_dev,
>  					     disk.kobj);
> +
> +	kfree(dc->sb_disk_data);
>  	kfree(dc);
>  	module_put(THIS_MODULE);
>  }
> @@ -1138,7 +1139,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
>  
>  /* Cached device - bcache superblock */
>  
> -static void register_bdev(struct cache_sb *sb, struct page *sb_page,
> +static void register_bdev(struct cache_sb *sb, void *sb_disk_data,
>  				 struct block_device *bdev,
>  				 struct cached_dev *dc)
>  {
> @@ -1152,9 +1153,8 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>  
>  	bio_init(&dc->sb_bio);
>  	dc->sb_bio.bi_max_vecs	= 1;
> -	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
> -	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
> -	get_page(sb_page);
> +	dc->sb_bio.bi_io_vec	= &dc->sb_bv[0];
> +	dc->sb_disk_data	= sb_disk_data;
>  
>  	if (cached_dev_init(dc, sb->block_size << 9))
>  		goto err;
> @@ -1179,6 +1179,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>  	return;
>  err:
>  	pr_notice("error opening %s: %s", bdevname(bdev, name), err);
> +	kfree(sb_disk_data);
>  	bcache_device_stop(&dc->disk);
>  }
>  
> @@ -1793,8 +1794,7 @@ void bch_cache_release(struct kobject *kobj)
>  	for (i = 0; i < RESERVE_NR; i++)
>  		free_fifo(&ca->free[i]);
>  
> -	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
> -		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
> +	kfree(ca->sb_disk_data);
>  
>  	if (!IS_ERR_OR_NULL(ca->bdev))
>  		blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
> @@ -1838,7 +1838,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
>  	return 0;
>  }
>  
> -static int register_cache(struct cache_sb *sb, struct page *sb_page,
> +static int register_cache(struct cache_sb *sb, void *sb_disk_data,
>  				struct block_device *bdev, struct cache *ca)
>  {
>  	char name[BDEVNAME_SIZE];
> @@ -1851,16 +1851,17 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
>  
>  	bio_init(&ca->sb_bio);
>  	ca->sb_bio.bi_max_vecs	= 1;
> -	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
> -	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
> -	get_page(sb_page);
> +	ca->sb_bio.bi_io_vec	= &ca->sb_bv[0];
> +	ca->sb_disk_data	= sb_disk_data;
>  
>  	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
>  		ca->discard = CACHE_DISCARD(&ca->sb);
>  
>  	ret = cache_alloc(sb, ca);
> -	if (ret != 0)
> +	if (ret != 0) {
> +		err = "error calling cache_alloc";
>  		goto err;
> +	}
>  
>  	if (kobject_add(&ca->kobj, &part_to_dev(bdev->bd_part)->kobj, "bcache")) {
>  		err = "error calling kobject_add";
> @@ -1883,8 +1884,10 @@ out:
>  	kobject_put(&ca->kobj);
>  
>  err:
> -	if (err)
> +	if (err) {
>  		pr_notice("error opening %s: %s", bdevname(bdev, name), err);
> +		kfree(sb_disk_data);
> +	}
>  
>  	return ret;
>  }
> @@ -1935,13 +1938,14 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  	char *path = NULL;
>  	struct cache_sb *sb = NULL;
>  	struct block_device *bdev = NULL;
> -	struct page *sb_page = NULL;
> +	void *sb_disk_data = NULL;
>  
>  	if (!try_module_get(THIS_MODULE))
>  		return -EBUSY;
>  
>  	if (!(path = kstrndup(buffer, size, GFP_KERNEL)) ||
> -	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)))
> +	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)) ||
> +	    !(sb_disk_data = kmalloc(SB_SIZE, GFP_KERNEL)))
>  		goto err;
>  
>  	err = "failed to open device";
> @@ -1967,7 +1971,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  	if (set_blocksize(bdev, 4096))
>  		goto err_close;
>  
> -	err = read_super(sb, bdev, &sb_page);
> +	err = read_super(sb, bdev, sb_disk_data);
>  	if (err)
>  		goto err_close;
>  
> @@ -1977,20 +1981,23 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  			goto err_close;
>  
>  		mutex_lock(&bch_register_lock);
> -		register_bdev(sb, sb_page, bdev, dc);
> +		register_bdev(sb, sb_disk_data, bdev, dc);
> +		sb_disk_data = NULL; /* Consumed or freed in register call */
>  		mutex_unlock(&bch_register_lock);
>  	} else {
>  		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
>  		if (!ca)
>  			goto err_close;
>  
> -		if (register_cache(sb, sb_page, bdev, ca) != 0)
> +		if (register_cache(sb, sb_disk_data, bdev, ca) != 0) {
> +			sb_disk_data = NULL;
>  			goto err_close;
> +		}
> +		sb_disk_data = NULL;
>  	}
>  out:
> -	if (sb_page)
> -		put_page(sb_page);
>  	kfree(sb);
> +	kfree(sb_disk_data);
>  	kfree(path);
>  	module_put(THIS_MODULE);
>  	return ret;
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-28  5:55           ` Kent Overstreet
@ 2016-07-28 16:27             ` Stefan Bader
  2016-08-04 10:03               ` Stefan Bader
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Bader @ 2016-07-28 16:27 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw, Stefan Bader


[-- Attachment #1.1: Type: text/plain, Size: 3390 bytes --]

On 28.07.2016 07:55, Kent Overstreet wrote:
> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
>> So here is another attempt which does half the proposed changes. And before you
>> point out that it looks still ugly, let me explain some reasons. The goal for me
>> would be to have something as small as possible to be acceptable as stable change.
>> And the part about putting a bio on the stack and using submit_bio_wait: I
>> believe you meant in read_super to replace the __bread. I was thinking about
>> that but in the end it seemed to make the change unnecessary big. Whether using
>> __bread or submit_bio_wait, in both cases and without needing to make more
>> changes on the write side, read_super has to return the in-memory and on-disk
>> variant of the superblock. So as long as nothing that is related to __bread is
>> leaked out of read_super, it is much better than what is there now. And I remain
>> as small as possible for the delta.
> 
> I like that approach much better. I suppose it's not _strictly_ necessary to rip
> out the __bread()...
> 
> Only other comment is that you shouldn't have to pass the buffer to
> __write_super() - I'd just move the bch_bio_map() call to when the struct
> cache/cached_dev is being initialized (or rip it out and initialize the bvec by
> hand) and never touch it after that.

Hm, doing that would save three simple changes to add a new argument to that
private functions at the cost of haven the map call twice and a (more)
complicated calculation of the
> 
>> So there is one part of the patch which I find hard to do in a better manner but
>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
>> paths but not on success when it is assigned to either cache or cached_dev.
>> Could possibly pass the address of the pointer and then clear it inside the
>> called functions. Not sure that would be much less strange...
> 
> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I
> added that "disk_sb" struct - it owns the buffer and random other crap. You
> could read that patch for ideas if you want, look at how it transfers ownership
> of the disk_sb. 
> 

I had a look but it felt like I could get into too much follow-up changes going
that path. But I think I got it simpler now. One note about that area: both
register calls can run into problems but only one actually returns that status.
And both do not seem to free the allocated structures (cache or cache_dev). It
is at least not obvious whether this is ever done.
I working around this by moving the assignment of the buffer page and the
mapping to a place where an error exit no longer is possible. So the release
functions will only see a non NULL pointer if things went well (reality required
to revise that a little bit as one of the register calls that can fail is
actually doing the write).

So now there is just one oddness: when I am testing this (unfortunately right
now I only have a normal 4k case), after calling make-bache with cache and
backing device, this all looks great and debugging shows the __write_super being
called. But reading the from userspace will return the old data until I stop the
bcache device and unregister the cache (which does not show any further writes).
I cannot decide what I should think here...

-Stefan


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-bcache-read_super-handle-architectures-with-more-tha.patch --]
[-- Type: text/x-diff; name="0001-bcache-read_super-handle-architectures-with-more-tha.patch", Size: 7910 bytes --]

From 982a4ff25d4dbd114432b4b2f908182995f402a0 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 58 ++++++++++++++++++++++++++++------------------
 2 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a5..3c48927 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -295,6 +295,7 @@ struct cached_dev {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 	struct closure		sb_write;
 	struct semaphore	sb_write_mutex;
 
@@ -382,6 +383,7 @@ struct cache {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 
 	struct kobject		kobj;
 	struct block_device	*bdev;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e169739..14f3304 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			      struct page **res)
+			      void *sb_data)
 {
 	const char *err;
 	struct cache_sb *s;
@@ -191,8 +191,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	sb->last_mount = get_seconds();
 	err = NULL;
 
-	get_page(bh->b_page);
-	*res = bh->b_page;
+	memcpy(sb_data, bh->b_data, SB_SIZE);
 err:
 	put_bh(bh);
 	return err;
@@ -208,13 +207,13 @@ static void write_bdev_super_endio(struct bio *bio)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) +
+			       bio->bi_io_vec[0].bv_offset;
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
 	bio->bi_rw		= REQ_SYNC|REQ_META;
 	bio->bi_iter.bi_size	= SB_SIZE;
-	bch_bio_map(bio, NULL);
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -1045,6 +1044,8 @@ void bch_cached_dev_release(struct kobject *kobj)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
+
+	kfree(dc->sb_disk_data);
 	kfree(dc);
 	module_put(THIS_MODULE);
 }
@@ -1138,7 +1139,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
 
 /* Cached device - bcache superblock */
 
-static void register_bdev(struct cache_sb *sb, struct page *sb_page,
+static void register_bdev(struct cache_sb *sb, void *sb_disk_data,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1152,9 +1153,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&dc->sb_bio);
 	dc->sb_bio.bi_max_vecs	= 1;
-	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
-	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
-	get_page(sb_page);
+	dc->sb_bio.bi_io_vec	= &dc->sb_bv[0];
 
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
@@ -1168,6 +1167,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 	pr_info("registered backing device %s", bdevname(bdev, name));
 
+	/* Do assignment and mapping late, cannot error after this */
+	dc->sb_disk_data = sb_disk_data;
+	dc->sb_bio.bi_iter.bi_size = SB_SIZE;
+	bch_bio_map(&dc->sb_bio, sb_disk_data);
+
 	list_add(&dc->list, &uncached_devices);
 	list_for_each_entry(c, &bch_cache_sets, list)
 		bch_cached_dev_attach(dc, c);
@@ -1179,6 +1183,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	return;
 err:
 	pr_notice("error opening %s: %s", bdevname(bdev, name), err);
+	kfree(sb_disk_data);
 	bcache_device_stop(&dc->disk);
 }
 
@@ -1793,8 +1798,7 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
-	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
-		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
+	kfree(ca->sb_disk_data);
 
 	if (!IS_ERR_OR_NULL(ca->bdev))
 		blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1838,7 +1842,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
 	return 0;
 }
 
-static int register_cache(struct cache_sb *sb, struct page *sb_page,
+static int register_cache(struct cache_sb *sb, void *sb_disk_data,
 				struct block_device *bdev, struct cache *ca)
 {
 	char name[BDEVNAME_SIZE];
@@ -1851,16 +1855,16 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&ca->sb_bio);
 	ca->sb_bio.bi_max_vecs	= 1;
-	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
-	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
-	get_page(sb_page);
+	ca->sb_bio.bi_io_vec	= &ca->sb_bv[0];
 
 	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
 		ca->discard = CACHE_DISCARD(&ca->sb);
 
 	ret = cache_alloc(sb, ca);
-	if (ret != 0)
+	if (ret != 0) {
+		err = "error calling cache_alloc";
 		goto err;
+	}
 
 	if (kobject_add(&ca->kobj, &part_to_dev(bdev->bd_part)->kobj, "bcache")) {
 		err = "error calling kobject_add";
@@ -1868,11 +1872,17 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 		goto out;
 	}
 
+	/* Do assignment and mapping late */
+	ca->sb_disk_data = sb_disk_data;
+	ca->sb_bio.bi_iter.bi_size = SB_SIZE;
+	bch_bio_map(&ca->sb_bio, sb_disk_data);
+
 	mutex_lock(&bch_register_lock);
 	err = register_cache_set(ca);
 	mutex_unlock(&bch_register_lock);
 
 	if (err) {
+		ca->sb_disk_data = NULL;
 		ret = -ENODEV;
 		goto out;
 	}
@@ -1935,13 +1945,14 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	char *path = NULL;
 	struct cache_sb *sb = NULL;
 	struct block_device *bdev = NULL;
-	struct page *sb_page = NULL;
+	void *sb_disk_data = NULL;
 
 	if (!try_module_get(THIS_MODULE))
 		return -EBUSY;
 
 	if (!(path = kstrndup(buffer, size, GFP_KERNEL)) ||
-	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)))
+	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)) ||
+	    !(sb_disk_data = kmalloc(SB_SIZE, GFP_KERNEL)))
 		goto err;
 
 	err = "failed to open device";
@@ -1967,7 +1978,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (set_blocksize(bdev, 4096))
 		goto err_close;
 
-	err = read_super(sb, bdev, &sb_page);
+	err = read_super(sb, bdev, sb_disk_data);
 	if (err)
 		goto err_close;
 
@@ -1977,19 +1988,20 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto err_close;
 
 		mutex_lock(&bch_register_lock);
-		register_bdev(sb, sb_page, bdev, dc);
+		register_bdev(sb, sb_disk_data, bdev, dc);
+		sb_disk_data = NULL; /* Consumed or freed in register call */
 		mutex_unlock(&bch_register_lock);
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 		if (!ca)
 			goto err_close;
 
-		if (register_cache(sb, sb_page, bdev, ca) != 0)
+		if (register_cache(sb, sb_disk_data, bdev, ca) != 0)
 			goto err_close;
+		sb_disk_data = NULL;
 	}
 out:
-	if (sb_page)
-		put_page(sb_page);
+	kfree(sb_disk_data);
 	kfree(sb);
 	kfree(path);
 	module_put(THIS_MODULE);
-- 
1.9.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: bcache super block corruption with non 4k pages
  2016-07-28 16:27             ` Stefan Bader
@ 2016-08-04 10:03               ` Stefan Bader
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Bader @ 2016-08-04 10:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, dm-devel, Linux Kernel Mailing List,
	liuzhengyuang521, bcache, apw, Stefan Bader


[-- Attachment #1.1: Type: text/plain, Size: 4060 bytes --]

On 28.07.2016 18:27, Stefan Bader wrote:
> On 28.07.2016 07:55, Kent Overstreet wrote:
>> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
>>> So here is another attempt which does half the proposed changes. And before you
>>> point out that it looks still ugly, let me explain some reasons. The goal for me
>>> would be to have something as small as possible to be acceptable as stable change.
>>> And the part about putting a bio on the stack and using submit_bio_wait: I
>>> believe you meant in read_super to replace the __bread. I was thinking about
>>> that but in the end it seemed to make the change unnecessary big. Whether using
>>> __bread or submit_bio_wait, in both cases and without needing to make more
>>> changes on the write side, read_super has to return the in-memory and on-disk
>>> variant of the superblock. So as long as nothing that is related to __bread is
>>> leaked out of read_super, it is much better than what is there now. And I remain
>>> as small as possible for the delta.
>>
>> I like that approach much better. I suppose it's not _strictly_ necessary to rip
>> out the __bread()...
>>
>> Only other comment is that you shouldn't have to pass the buffer to
>> __write_super() - I'd just move the bch_bio_map() call to when the struct
>> cache/cached_dev is being initialized (or rip it out and initialize the bvec by
>> hand) and never touch it after that.
> 
> Hm, doing that would save three simple changes to add a new argument to that
> private functions at the cost of haven the map call twice and a (more)
> complicated calculation of the
>>
>>> So there is one part of the patch which I find hard to do in a better manner but
>>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
>>> paths but not on success when it is assigned to either cache or cached_dev.
>>> Could possibly pass the address of the pointer and then clear it inside the
>>> called functions. Not sure that would be much less strange...
>>
>> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I
>> added that "disk_sb" struct - it owns the buffer and random other crap. You
>> could read that patch for ideas if you want, look at how it transfers ownership
>> of the disk_sb. 
>>
> 
> I had a look but it felt like I could get into too much follow-up changes going
> that path. But I think I got it simpler now. One note about that area: both
> register calls can run into problems but only one actually returns that status.
> And both do not seem to free the allocated structures (cache or cache_dev). It
> is at least not obvious whether this is ever done.
> I working around this by moving the assignment of the buffer page and the
> mapping to a place where an error exit no longer is possible. So the release
> functions will only see a non NULL pointer if things went well (reality required
> to revise that a little bit as one of the register calls that can fail is
> actually doing the write).
> 
> So now there is just one oddness: when I am testing this (unfortunately right
> now I only have a normal 4k case), after calling make-bache with cache and
> backing device, this all looks great and debugging shows the __write_super being
> called. But reading the from userspace will return the old data until I stop the
> bcache device and unregister the cache (which does not show any further writes).
> I cannot decide what I should think here...

So its __bread doing cached operations and after not (ab)using the page from
there, the writes will never cause the cached old result to go away until the
device gets closed. So (hopefully) last apptempt. Also drop __bread and use
submit_bio_wait. This seems to work. On x64 and ppc64le with 64k pages.

So the only pain with this is that upstream seems to have re-organized the
submit_bio functions in the upcoming v4.7-rc2 kernel. So it is without rework
only applicable to older kernels. I was using a 4.4 one, 4.6 in theory should be
ok...

-Stefan


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-bcache-read_super-handle-architectures-with-more-tha.patch --]
[-- Type: text/x-diff; name="0001-bcache-read_super-handle-architectures-with-more-tha.patch", Size: 8984 bytes --]

From 91bbee69d19b59ec3a08a2c4e7c92747a23b3956 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.
Also get rid of __bread as this is doing cached IO and the only
reason this did not cause weird effects was the direct use of the
page cache page that was returned by __bread.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 76 +++++++++++++++++++++++++++++-----------------
 2 files changed, 50 insertions(+), 28 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a5..3c48927 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -295,6 +295,7 @@ struct cached_dev {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 	struct closure		sb_write;
 	struct semaphore	sb_write_mutex;
 
@@ -382,6 +383,7 @@ struct cache {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 
 	struct kobject		kobj;
 	struct block_device	*bdev;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e169739..5a2c848 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -62,17 +62,27 @@ struct workqueue_struct *bcache_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			      struct page **res)
+			      void *sb_data)
 {
 	const char *err;
 	struct cache_sb *s;
-	struct buffer_head *bh = __bread(bdev, 1, SB_SIZE);
+	struct bio rs_bio;
+	struct bio_vec rs_bv[1];
 	unsigned i;
 
-	if (!bh)
+	bio_init(&rs_bio);
+	rs_bio.bi_bdev			= bdev;
+	rs_bio.bi_rw			= READ;
+	rs_bio.bi_max_vecs		= 1;
+	rs_bio.bi_io_vec		= &rs_bv[0];
+	rs_bio.bi_iter.bi_sector	= SB_SECTOR;
+	rs_bio.bi_iter.bi_size		= SB_SIZE;
+	bch_bio_map(&rs_bio, sb_data);
+
+	if (submit_bio_wait(READ, &rs_bio))
 		return "IO error";
 
-	s = (struct cache_sb *) bh->b_data;
+	s = (struct cache_sb *) sb_data;
 
 	sb->offset		= le64_to_cpu(s->offset);
 	sb->version		= le64_to_cpu(s->version);
@@ -191,10 +201,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	sb->last_mount = get_seconds();
 	err = NULL;
 
-	get_page(bh->b_page);
-	*res = bh->b_page;
 err:
-	put_bh(bh);
 	return err;
 }
 
@@ -208,13 +215,13 @@ static void write_bdev_super_endio(struct bio *bio)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) +
+			       bio->bi_io_vec[0].bv_offset;
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
 	bio->bi_rw		= REQ_SYNC|REQ_META;
 	bio->bi_iter.bi_size	= SB_SIZE;
-	bch_bio_map(bio, NULL);
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -238,7 +245,7 @@ static void __write_super(struct cache_sb *sb, struct bio *bio)
 	pr_debug("ver %llu, flags %llu, seq %llu",
 		 sb->version, sb->flags, sb->seq);
 
-	submit_bio(REQ_WRITE, bio);
+	submit_bio(WRITE_FLUSH_FUA, bio);
 }
 
 static void bch_write_bdev_super_unlock(struct closure *cl)
@@ -1045,6 +1052,8 @@ void bch_cached_dev_release(struct kobject *kobj)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
+
+	kfree(dc->sb_disk_data);
 	kfree(dc);
 	module_put(THIS_MODULE);
 }
@@ -1138,7 +1147,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
 
 /* Cached device - bcache superblock */
 
-static void register_bdev(struct cache_sb *sb, struct page *sb_page,
+static void register_bdev(struct cache_sb *sb, void *sb_disk_data,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1152,9 +1161,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&dc->sb_bio);
 	dc->sb_bio.bi_max_vecs	= 1;
-	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
-	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
-	get_page(sb_page);
+	dc->sb_bio.bi_io_vec	= &dc->sb_bv[0];
 
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
@@ -1168,6 +1175,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 	pr_info("registered backing device %s", bdevname(bdev, name));
 
+	/* Do assignment and mapping late, cannot error after this */
+	dc->sb_disk_data = sb_disk_data;
+	dc->sb_bio.bi_iter.bi_size = SB_SIZE;
+	bch_bio_map(&dc->sb_bio, sb_disk_data);
+
 	list_add(&dc->list, &uncached_devices);
 	list_for_each_entry(c, &bch_cache_sets, list)
 		bch_cached_dev_attach(dc, c);
@@ -1179,6 +1191,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	return;
 err:
 	pr_notice("error opening %s: %s", bdevname(bdev, name), err);
+	kfree(sb_disk_data);
 	bcache_device_stop(&dc->disk);
 }
 
@@ -1793,8 +1806,7 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
-	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
-		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
+	kfree(ca->sb_disk_data);
 
 	if (!IS_ERR_OR_NULL(ca->bdev))
 		blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1838,7 +1850,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
 	return 0;
 }
 
-static int register_cache(struct cache_sb *sb, struct page *sb_page,
+static int register_cache(struct cache_sb *sb, void *sb_disk_data,
 				struct block_device *bdev, struct cache *ca)
 {
 	char name[BDEVNAME_SIZE];
@@ -1851,16 +1863,16 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 	bio_init(&ca->sb_bio);
 	ca->sb_bio.bi_max_vecs	= 1;
-	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
-	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
-	get_page(sb_page);
+	ca->sb_bio.bi_io_vec	= &ca->sb_bv[0];
 
 	if (blk_queue_discard(bdev_get_queue(ca->bdev)))
 		ca->discard = CACHE_DISCARD(&ca->sb);
 
 	ret = cache_alloc(sb, ca);
-	if (ret != 0)
+	if (ret != 0) {
+		err = "error calling cache_alloc";
 		goto err;
+	}
 
 	if (kobject_add(&ca->kobj, &part_to_dev(bdev->bd_part)->kobj, "bcache")) {
 		err = "error calling kobject_add";
@@ -1868,11 +1880,17 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 		goto out;
 	}
 
+	/* Do assignment and mapping late */
+	ca->sb_disk_data = sb_disk_data;
+	ca->sb_bio.bi_iter.bi_size = SB_SIZE;
+	bch_bio_map(&ca->sb_bio, sb_disk_data);
+
 	mutex_lock(&bch_register_lock);
 	err = register_cache_set(ca);
 	mutex_unlock(&bch_register_lock);
 
 	if (err) {
+		ca->sb_disk_data = NULL;
 		ret = -ENODEV;
 		goto out;
 	}
@@ -1935,13 +1953,14 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	char *path = NULL;
 	struct cache_sb *sb = NULL;
 	struct block_device *bdev = NULL;
-	struct page *sb_page = NULL;
+	void *sb_disk_data = NULL;
 
 	if (!try_module_get(THIS_MODULE))
 		return -EBUSY;
 
 	if (!(path = kstrndup(buffer, size, GFP_KERNEL)) ||
-	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)))
+	    !(sb = kmalloc(sizeof(struct cache_sb), GFP_KERNEL)) ||
+	    !(sb_disk_data = kmalloc(SB_SIZE, GFP_KERNEL)))
 		goto err;
 
 	err = "failed to open device";
@@ -1967,7 +1986,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (set_blocksize(bdev, 4096))
 		goto err_close;
 
-	err = read_super(sb, bdev, &sb_page);
+	err = read_super(sb, bdev, sb_disk_data);
 	if (err)
 		goto err_close;
 
@@ -1977,19 +1996,20 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto err_close;
 
 		mutex_lock(&bch_register_lock);
-		register_bdev(sb, sb_page, bdev, dc);
+		register_bdev(sb, sb_disk_data, bdev, dc);
+		sb_disk_data = NULL; /* Consumed or freed in register call */
 		mutex_unlock(&bch_register_lock);
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 		if (!ca)
 			goto err_close;
 
-		if (register_cache(sb, sb_page, bdev, ca) != 0)
+		if (register_cache(sb, sb_disk_data, bdev, ca) != 0)
 			goto err_close;
+		sb_disk_data = NULL;
 	}
 out:
-	if (sb_page)
-		put_page(sb_page);
+	kfree(sb_disk_data);
 	kfree(sb);
 	kfree(path);
 	module_put(THIS_MODULE);
-- 
1.9.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-08-04 10:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1469091513-11233-1-git-send-email-stefan.bader@canonical.com>
2016-07-26  9:51 ` bcache super block corruption with non 4k pages Stefan Bader
2016-07-26 10:21   ` Kent Overstreet
2016-07-26 12:32     ` Stefan Bader
2016-07-26 12:49       ` Kent Overstreet
2016-07-27 15:16         ` Stefan Bader
2016-07-28  5:55           ` Kent Overstreet
2016-07-28 16:27             ` Stefan Bader
2016-08-04 10:03               ` Stefan Bader

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).