linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hou Tao <houtao1@huawei.com>
To: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>, <linux-block@vger.kernel.org>,
	"Christoph Hellwig" <hch@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Thiago Jung Bauermann" <bauerman@linux.vnet.ibm.com>,
	Tejun Heo <tj@kernel.org>, Tahsin Erdogan <tahsin@google.com>,
	Omar Sandoval <osandov@osandov.com>
Subject: Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
Date: Fri, 24 Nov 2017 11:12:58 +0800	[thread overview]
Message-ID: <776fb2d9-3ea4-48a3-3de1-7040d6047cd3@huawei.com> (raw)
In-Reply-To: <20171120164300.GE12826@quack2.suse.cz>

Hi Jan,

On 2017/11/21 0:43, Jan Kara wrote:
> Hi Tao!
> 
> On Fri 17-11-17 14:51:18, Hou Tao wrote:
>> On 2017/3/13 23:14, Jan Kara wrote:
>>> blkdev_open() may race with gendisk shutdown in two different ways.
>>> Either del_gendisk() has already unhashed block device inode (and thus
>>> bd_acquire() will end up creating new block device inode) however
>>> gen_gendisk() will still return the gendisk that is being destroyed.
>>> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
>>> before we get to get_gendisk() and get_gendisk() will return new gendisk
>>> that got allocated for device that reused the device number.
>>>
>>> In both cases this will result in possible inconsistencies between
>>> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
>>> destroyed and device number reused, in the second case immediately).
>>
>>> Fix the problem by checking whether the gendisk is still alive and inode
>>> hashed when associating bdev inode with it and its bdi. That way we are
>>> sure that we will not associate bdev inode with disk that got past
>>> blk_unregister_region() in del_gendisk() (and thus device number can get
>>> reused). Similarly, we will not associate bdev that was once associated
>>> with gendisk that is going away (and thus the corresponding bdev inode
>>> will get unhashed in del_gendisk()) with a new gendisk that is just
>>> reusing the device number.
>>
>> I have run some extended tests based on Omar Sandoval's script [1] these days,
>> and found two new problems related with the race between bdev open and gendisk
>> shutdown.
>>
>> The first problem lies in __blkdev_get(), and will cause use-after-free, or
>> worse oops. It happens because there may be two instances of gendisk related
>> with one bdev. When the process which owns the newer gendisk completes the
>> invocation of __blkdev_get() firstly, the other process which owns the older
>> gendisk will put the last reference of the older gendisk and cause use-after-free.
>>
>> The following call sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda
>>
>> process A (blkdev_open()):
>> 	bd_acquire() returns new bdev_inode of /dev/sda
>> 	get_gendisk() returns gendisk (v1 gendisk)
>>
>> remove gendisk from bdev_map
>> device number is reused
> 
> ^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().

>> process B (blkdev_open()):
>> 	bd_acquire() returns new bdev_inode of /dev/sda
>> 	get_gendisk() returns a new gendisk (v2 gendisk)
>> 	increase bdev->bd_openers
> 
> So this needs to be a bit more complex - bd_acquire() must happen before
> bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
> happen after blk_unregister_region() from del_gendisk() happened. But yes,
> this seems possible.
> 
>> process A (blkdev_open()):
>> 	find bdev->bd_openers != 0
>> 	put_disk(disk) // this is the last reference to v1 gendisk
>> 	disk_unblock_events(disk) // use-after-free occurs
>>
>> The problem can be fixed by an extra check showed in the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> 	if (!bdev->bd_openers) {
>> 	} else {
>> 		if (bdev->bd_disk != disk) {
>> 			ret = -ENXIO;
>> 			goto out_unlock_bdev;
>> 		}
>> 	}
>> }
>>
>> And we also can simplify the check in your patch by moving
>> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
>> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
>> or a new gendisk.
> 
> I don't think we can do that. If we call blk_unregister_region() before
> bdev_unhash_inode(), the device number can get reused while bdev inode is
> still visible. So you can get that bdev inode v1 associated with gendisk
> v2. And open following unhashing of bdev inode v1 will get bdev inode v2
> associated with gendisk v2. So you have two different bdev inodes
> associated with the same gendisk and that will cause data integrity issues.

Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.

> But what you suggest above is probably a reasonable fix. Will you submit a
> proper patch please?

Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.

>> And the only check we need to do in __blkdev_get() is
>> checking the hash status of the bdev_inode after we get a gendisk from
>> get_gendisk(). The check can be done like the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> 	/* ...... */
>> 	ret = -ENXIO;
>> 	disk = get_gendisk(bdev->bd_dev, &partno);
>> 	if (!disk)
>> 		goto out;
>> 	owner = disk->fops->owner;
>>
>> 	if (inode_unhashed(bdev->bd_inode))
>> 		goto out_unmatched;
>> }
>>
>> The second problem lies in bd_start_claiming(), and will trigger the
>> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
>> It occurs when testing the exclusive open and close of the disk and its
>> partitions.
>>
>> The cause of the problem is that a bdev_inode of a disk partition
>> corresponds with two instances of bdev_inode of the whole disk
>> simultaneously. When one pair of the partition inode and disk inode
>> completes the claiming, the other pair will be stumbled by the BUG_ON
>> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
>>
>> The following sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda2
>>
>> process A (blkdev_open()):
>> 	bd_acquire() returns a new bdev_inode for /dev/sda2
>> 	bd_start_claiming() returns bdev_inode of /dev/sda
>>
>> process B (blkdev_open()):
>> 	bd_acquire() returns the new bdev_inode for /dev/sda2
>>
>> unhash the bdev_inode of /dev/sda
>> remove gendisk from bdev_map
>> device number is reused
>>
>> process B (blkdev_open()):
>> 	bd_start_claming() returns a new bdev_inode for /dev/sda
>>
>> process A (blkdev_open()):
>> 	__blkdev_get() returns successfully
>> 	finish claiming  // bdev->bd_holder = holder
>>
>> process B (blkdev_open()):
>> 	__blkdev_get() returns successfully
>> 	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))
> 
> Ah, good spotting. Essentially the whole->bd_claiming protection for
> partition bdev does not work because we've got two different 'whole' bdev
> pointers.

Yes, perfect summary.

>> And the problem can be fixed by adding more checks in bd_start_claiming():
>>
>> static struct block_device *bd_start_claiming(struct block_device *bdev,
>>                                               void *holder)
>> {
>> 	/* ...... */
>>         disk = get_gendisk(bdev->bd_dev, &partno);
>>         if (!disk)
>>                 return ERR_PTR(-ENXIO);
>>
>>         if (inode_unhashed(bdev->bd_inode)) {
>>                 module_put(disk->fops->owner);
>>                 put_disk(disk);
>>                 return ERR_PTR(-ENXIO);
>>         }
>>
>> 	/* ...... */
>>         if (!whole)
>>                 return ERR_PTR(-ENOMEM);
>>
>>         if (inode_unhashed(bdev->bd_inode) ||
>>                 inode_unhashed(whole->bd_inode)) {
>>                 bdput(whole);
>>                 return ERR_PTR(-ENXIO);
>>         }
>> }
>>
>> Except adding more and more checks, are there better solutions to the
>> race problem ?  I have thought about managing the device number by
>> ref-counter, so the device number will not be reused until the last
>> reference of bdev_inode and gendisk are released, and i am trying to
>> figure out a way to get/put a reference of the device number when get/put
>> the block_device.
>>
>> So any suggests and thoughts ?
> 
> Yeah, these races are nasty. I will think whether there's some reasonably
> easy way to fixing them or whether we'll have to go via some other route
> like blocking the device number as you suggest. In either case thanks for
> the report and the analysis!

Look forward to review them.

Regards,
Tao

> 								Honza
> 

  reply	other threads:[~2017-11-24  3:13 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-21 16:19   ` Tejun Heo
2017-03-23  0:21     ` Jan Kara
2017-03-23 15:38       ` Tejun Heo
2017-11-17  6:51   ` Hou Tao
2017-11-20 16:43     ` Jan Kara
2017-11-24  3:12       ` Hou Tao [this message]
2017-03-13 15:14 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
2017-03-13 15:14 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
2017-03-21 16:21   ` Tejun Heo
2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
2017-03-13 15:14 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
2017-03-13 15:14 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
2017-03-13 15:14 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
2017-03-13 18:10 ` [PATCH 0/11 v4] block: Fix block device shutdown related races Dan Williams
2017-03-21  0:57 ` Thiago Jung Bauermann
  -- strict thread matches above, loose matches on Subject: below --
2017-03-06 16:33 [PATCH 0/11 v3] " Jan Kara
2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-06 22:04   ` Tejun Heo
2017-03-07 11:14     ` Jan Kara
2017-03-07 19:43       ` Tejun Heo
2017-03-08 14:25         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=776fb2d9-3ea4-48a3-3de1-7040d6047cd3@huawei.com \
    --to=houtao1@huawei.com \
    --cc=axboe@kernel.dk \
    --cc=bauerman@linux.vnet.ibm.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=tahsin@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).