All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chandan Rajendra <chandan@linux.vnet.ibm.com>
To: bo.li.liu@oracle.com
Cc: linux-btrfs@vger.kernel.org, Eryu Guan <guaneryu@gmail.com>,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH] Btrfs: let super_stripesize match with sectorsize
Date: Fri, 17 Jun 2016 10:48:05 +0530	[thread overview]
Message-ID: <69818794.Hm6NXvuvrY@reserved-192-168-1-6.rchland.ibm.com> (raw)
In-Reply-To: <20160616170141.GB27257@localhost.localdomain>

On Thursday, June 16, 2016 10:01:41 AM Liu Bo wrote:
> On Thu, Jun 16, 2016 at 01:53:59PM +0530, Chandan Rajendra wrote:
> > On Wednesday, June 15, 2016 05:09:55 PM Liu Bo wrote:
> > > On Wed, Jun 15, 2016 at 03:50:17PM +0530, Chandan Rajendra wrote:
> > > > On Wednesday, June 15, 2016 09:12:28 AM Chandan Rajendra wrote:
> > > > > Hello Liu Bo,
> > > > > 
> > > > > We have to fix the following check in check_super() as well,
> > > > > 
> > > > >        if (btrfs_super_stripesize(sb) != 4096) {
> > > > >        
> > > > >                 error("invalid stripesize %u",
> > > > >                 btrfs_super_stripesize(sb));
> > > > >                 goto error_out;
> > > > >         
> > > > >         }
> > > > > 
> > > > > i.e. btrfs_super_stripesize(sb) must be equal to
> > > > > btrfs_super_sectorsize(sb).
> > > > > 
> > > > > However in btrfs-progs (mkfs.c to be precise) since we had
> > > > > stripesize
> > > > > hardcoded to 4096, setting stripesize to the value of sectorsize in
> > > > > mkfs.c will cause the following to occur when mkfs.btrfs is invoked
> > > > > for
> > > > > devices with existing Btrfs filesystem instances,
> > > > > 
> > > > > NOTE: Assume we have changed the stripesize validation in
> > > > > btrfs-progs'
> > > > > check_super() to,
> > > > > 
> > > > >         if (btrfs_super_stripesize(sb) !=
> > > > >         btrfs_super_sectorsize(sb)) {
> > > > >         
> > > > >                 error("invalid stripesize %u",
> > > > >                 btrfs_super_stripesize(sb));
> > > > >                 goto error_out;
> > > > >         
> > > > >         }
> > > > > 
> > > > > main()
> > > > > 
> > > > >  for each device file passed as an argument,
> > > > >  
> > > > >    test_dev_for_mkfs()
> > > > >    
> > > > >      check_mounted
> > > > >      
> > > > >        check_mounted_where
> > > > >        
> > > > >          btrfs_scan_one_device
> > > > >          
> > > > >            btrfs_read_dev_super
> > > > >            
> > > > >              check_super() call will fail for existing filesystems
> > > > >              which
> > > > > 
> > > > > have stripesize set to 4k. All existing filesystem instances will
> > > > > fall
> > > > > into
> > > > > this category.
> > > > > 
> > > > > This error value is pushed up the call stack and this causes the
> > > > > device
> > > > > to
> > > > > not get added to the fs_devices_mnt list in check_mounted_where().
> > > > > Hence
> > > > > we
> > > > > would fail to correctly check the mount status of the multi-device
> > > > > btrfs
> > > > > filesystems.
> > > > 
> > > > We can end up in the following scenario,
> > > > - /dev/loop0, /dev/loop1 and /dev/loop2 are mounted as a single
> > > > 
> > > >   filesystem. The filesystem was created by an older version of
> > > >   mkfs.btrfs
> > > >   which set stripesize to 4k.
> > > > 
> > > > - losetup -a
> > > > 
> > > >    /dev/loop0: [0030]:19477 (/root/disk-imgs/file-0.img)
> > > >    /dev/loop1: [0030]:16577 (/root/disk-imgs/file-1.img)
> > > >    /dev/loop2: [64770]:3423229 (/root/disk-imgs/file-2.img)
> > > > 
> > > > - /etc/mtab lists only /dev/loop0
> > > > - "losetup /dev/loop4 /root/disk-imgs/file-1.img"
> > > > 
> > > >    The new mkfs.btrfs invoked as 'mkfs.btrfs -f /dev/loop4' succeeds
> > > >    even
> > > >    though /dev/loop1 has already been mounted and has
> > > >    /root/disk-imgs/file-1.img as its backing file.
> > > > 
> > > > So IMHO the only solution is to have the stripesize check in
> > > > check_super()
> > > > to allow both '4k' and 'sectorsize' as valid values i.e.
> > > > 
> > > >         if ((btrfs_super_stripesize(sb) != 4096)
> > > > 	    
> > > > 	    && (btrfs_super_stripesize(sb) != btrfs_super_sectorsize(sb))) {
> > > > 	    
> > > >                 error("invalid stripesize %u",
> > > >                 btrfs_super_stripesize(sb));
> > > >                 goto error_out;
> > > >         
> > > >         }
> > > 
> > > That's a good one.
> > > 
> > > But if we go back to the original point, in the kernel side,
> > > 1. in open_ctree(), root->stripesize = btrfs_super_stripesize();
> > > 
> > > 2. in find_free_extent(),
> > > 
> > > 	...
> > > 	search_start = ALIGN(offset, root->stripesize);
> > > 	...
> > > 
> > > 3. in btrfs_alloc_tree_block(),
> > > 
> > > 	...
> > > 	ret = btrfs_reserve_extent(..., &ins, ...);
> > > 	...
> > > 	buf = btrfs_init_new_buffer(trans, root, ins.objectid, level);
> > > 
> > > 4. in btrfs_init_new_buffer(),
> > > 
> > > 	...
> > > 	buf = btrfs_find_create_tree_block(root, bytenr);
> > > 	...
> > > 
> > > Because 'num_bytes' we pass to find_free_extent() is aligned to
> > > sectorsize, the free space we can find is aligned to sectorsize,
> > > which means 'offset' in '1. find_free_extent()' is aligned to
> > > sectorsize.
> > > 
> > > If our stripesize is larger than sectorsize, say 4 * sectorsize,
> > > for data allocation it's fine while for metadata block allocations it's
> > > not.  It is possible that when we allocate a new metadata block, we can
> > > end up with an existing eb returned by '4. in btrfs_init_new_buffer()'.
> > 
> > Liu, I am sorry ... I am unable to visualize a scenario where the above
> > described scenario could happen. Can you please provide an example?
> 
> Sure, imagine that sectorsize is 4k and stripesize is 16k,
> and a tree root's eb has eb->start = 12599296 (12582912 + 16384, a typical
> bytenr in btrfs) which is aligned to 4k, and when CoW happens on another
> eb,
> 
> __btrfs_cow_block()
>   ->btrfs_alloc_tree_block()
>     ->btrfs_reserve_extent()
>       ->find_free_extent()
>     ->btrfs_init_new_buffer()
> 
> btrfs_reserve_extent() can return 12599296 for the new eb even if what
> it've found is (12582912 + 4096), but
>  after 'search_start = ALIGN(offset, root->stripesize)', it gets to
> 12599296.
> 
> In btrfs_init_new_buffer, we search eb tree by bytenr=12599296 and
> get tree root's eb, the following btrfs_tree_lock will scream.
> 
> The example is taken from
> btrfs-progs/tests/fuzz-tests/images/superblock-stripsize-bogus.raw.xz
> 

ah, this is indeed possible when nodesize is same as sectorsize
i.e. 4k. Thanks for the explaination Liubo.

The new validation patches that I have posted
(http://news.gmane.org/find-root.php?message_id=1466095078%2d25726%2d1%2dgit%2dsend%2demail%2dchandan%40linux.vnet.ibm.com
and
http://news.gmane.org/find-root.php?message_id=1466095109%2d26044%2d1%2dgit%2dsend%2demail%2dchandan%40linux.vnet.ibm.com)
restrict the stripesize to be either sectorsize or 4096. So I think these
restrictions are good enough to make sure we don't get into the situation
explained by you.

-- 
chandan


  reply	other threads:[~2016-06-17  5:18 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-14 21:33 [PATCH] Btrfs: let super_stripesize match with sectorsize Liu Bo
2016-06-15  3:42 ` Chandan Rajendra
2016-06-15 10:20   ` Chandan Rajendra
2016-06-16  0:09     ` Liu Bo
2016-06-16  8:23       ` Chandan Rajendra
2016-06-16 17:01         ` Liu Bo
2016-06-17  5:18           ` Chandan Rajendra [this message]
2016-06-17  6:08             ` Liu Bo
2016-06-17  6:51               ` Chandan Rajendra
2016-06-17 16:30                 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=69818794.Hm6NXvuvrY@reserved-192-168-1-6.rchland.ibm.com \
    --to=chandan@linux.vnet.ibm.com \
    --cc=bo.li.liu@oracle.com \
    --cc=dsterba@suse.com \
    --cc=guaneryu@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.