From: Hou Tao <houtao1@huawei.com>
To: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>, <linux-block@vger.kernel.org>,
"Christoph Hellwig" <hch@infradead.org>,
Dan Williams <dan.j.williams@intel.com>,
"Thiago Jung Bauermann" <bauerman@linux.vnet.ibm.com>,
Tejun Heo <tj@kernel.org>, Tahsin Erdogan <tahsin@google.com>,
Omar Sandoval <osandov@osandov.com>
Subject: Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
Date: Fri, 24 Nov 2017 11:12:58 +0800 [thread overview]
Message-ID: <776fb2d9-3ea4-48a3-3de1-7040d6047cd3@huawei.com> (raw)
In-Reply-To: <20171120164300.GE12826@quack2.suse.cz>
Hi Jan,
On 2017/11/21 0:43, Jan Kara wrote:
> Hi Tao!
>
> On Fri 17-11-17 14:51:18, Hou Tao wrote:
>> On 2017/3/13 23:14, Jan Kara wrote:
>>> blkdev_open() may race with gendisk shutdown in two different ways.
>>> Either del_gendisk() has already unhashed block device inode (and thus
>>> bd_acquire() will end up creating new block device inode) however
>>> gen_gendisk() will still return the gendisk that is being destroyed.
>>> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
>>> before we get to get_gendisk() and get_gendisk() will return new gendisk
>>> that got allocated for device that reused the device number.
>>>
>>> In both cases this will result in possible inconsistencies between
>>> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
>>> destroyed and device number reused, in the second case immediately).
>>
>>> Fix the problem by checking whether the gendisk is still alive and inode
>>> hashed when associating bdev inode with it and its bdi. That way we are
>>> sure that we will not associate bdev inode with disk that got past
>>> blk_unregister_region() in del_gendisk() (and thus device number can get
>>> reused). Similarly, we will not associate bdev that was once associated
>>> with gendisk that is going away (and thus the corresponding bdev inode
>>> will get unhashed in del_gendisk()) with a new gendisk that is just
>>> reusing the device number.
>>
>> I have run some extended tests based on Omar Sandoval's script [1] these days,
>> and found two new problems related with the race between bdev open and gendisk
>> shutdown.
>>
>> The first problem lies in __blkdev_get(), and will cause use-after-free, or
>> worse oops. It happens because there may be two instances of gendisk related
>> with one bdev. When the process which owns the newer gendisk completes the
>> invocation of __blkdev_get() firstly, the other process which owns the older
>> gendisk will put the last reference of the older gendisk and cause use-after-free.
>>
>> The following call sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda
>>
>> process A (blkdev_open()):
>> bd_acquire() returns new bdev_inode of /dev/sda
>> get_gendisk() returns gendisk (v1 gendisk)
>>
>> remove gendisk from bdev_map
>> device number is reused
>
> ^^ this is through blk_unregister_region() in del_gendisk(), isn't it?
Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().
>> process B (blkdev_open()):
>> bd_acquire() returns new bdev_inode of /dev/sda
>> get_gendisk() returns a new gendisk (v2 gendisk)
>> increase bdev->bd_openers
>
> So this needs to be a bit more complex - bd_acquire() must happen before
> bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
> happen after blk_unregister_region() from del_gendisk() happened. But yes,
> this seems possible.
>
>> process A (blkdev_open()):
>> find bdev->bd_openers != 0
>> put_disk(disk) // this is the last reference to v1 gendisk
>> disk_unblock_events(disk) // use-after-free occurs
>>
>> The problem can be fixed by an extra check showed in the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> if (!bdev->bd_openers) {
>> } else {
>> if (bdev->bd_disk != disk) {
>> ret = -ENXIO;
>> goto out_unlock_bdev;
>> }
>> }
>> }
>>
>> And we also can simplify the check in your patch by moving
>> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
>> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
>> or a new gendisk.
>
> I don't think we can do that. If we call blk_unregister_region() before
> bdev_unhash_inode(), the device number can get reused while bdev inode is
> still visible. So you can get that bdev inode v1 associated with gendisk
> v2. And open following unhashing of bdev inode v1 will get bdev inode v2
> associated with gendisk v2. So you have two different bdev inodes
> associated with the same gendisk and that will cause data integrity issues.
Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.
> But what you suggest above is probably a reasonable fix. Will you submit a
> proper patch please?
Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.
>> And the only check we need to do in __blkdev_get() is
>> checking the hash status of the bdev_inode after we get a gendisk from
>> get_gendisk(). The check can be done like the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> /* ...... */
>> ret = -ENXIO;
>> disk = get_gendisk(bdev->bd_dev, &partno);
>> if (!disk)
>> goto out;
>> owner = disk->fops->owner;
>>
>> if (inode_unhashed(bdev->bd_inode))
>> goto out_unmatched;
>> }
>>
>> The second problem lies in bd_start_claiming(), and will trigger the
>> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
>> It occurs when testing the exclusive open and close of the disk and its
>> partitions.
>>
>> The cause of the problem is that a bdev_inode of a disk partition
>> corresponds with two instances of bdev_inode of the whole disk
>> simultaneously. When one pair of the partition inode and disk inode
>> completes the claiming, the other pair will be stumbled by the BUG_ON
>> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
>>
>> The following sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda2
>>
>> process A (blkdev_open()):
>> bd_acquire() returns a new bdev_inode for /dev/sda2
>> bd_start_claiming() returns bdev_inode of /dev/sda
>>
>> process B (blkdev_open()):
>> bd_acquire() returns the new bdev_inode for /dev/sda2
>>
>> unhash the bdev_inode of /dev/sda
>> remove gendisk from bdev_map
>> device number is reused
>>
>> process B (blkdev_open()):
>> bd_start_claming() returns a new bdev_inode for /dev/sda
>>
>> process A (blkdev_open()):
>> __blkdev_get() returns successfully
>> finish claiming // bdev->bd_holder = holder
>>
>> process B (blkdev_open()):
>> __blkdev_get() returns successfully
>> trigger BUG_ON(!bd_may_claim(bdev, whole, holder))
>
> Ah, good spotting. Essentially the whole->bd_claiming protection for
> partition bdev does not work because we've got two different 'whole' bdev
> pointers.
Yes, perfect summary.
>> And the problem can be fixed by adding more checks in bd_start_claiming():
>>
>> static struct block_device *bd_start_claiming(struct block_device *bdev,
>> void *holder)
>> {
>> /* ...... */
>> disk = get_gendisk(bdev->bd_dev, &partno);
>> if (!disk)
>> return ERR_PTR(-ENXIO);
>>
>> if (inode_unhashed(bdev->bd_inode)) {
>> module_put(disk->fops->owner);
>> put_disk(disk);
>> return ERR_PTR(-ENXIO);
>> }
>>
>> /* ...... */
>> if (!whole)
>> return ERR_PTR(-ENOMEM);
>>
>> if (inode_unhashed(bdev->bd_inode) ||
>> inode_unhashed(whole->bd_inode)) {
>> bdput(whole);
>> return ERR_PTR(-ENXIO);
>> }
>> }
>>
>> Except adding more and more checks, are there better solutions to the
>> race problem ? I have thought about managing the device number by
>> ref-counter, so the device number will not be reused until the last
>> reference of bdev_inode and gendisk are released, and i am trying to
>> figure out a way to get/put a reference of the device number when get/put
>> the block_device.
>>
>> So any suggests and thoughts ?
>
> Yeah, these races are nasty. I will think whether there's some reasonably
> easy way to fixing them or whether we'll have to go via some other route
> like blocking the device number as you suggest. In either case thanks for
> the report and the analysis!
Look forward to review them.
Regards,
Tao
> Honza
>
next prev parent reply other threads:[~2017-11-24 3:13 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-21 16:19 ` Tejun Heo
2017-03-23 0:21 ` Jan Kara
2017-03-23 15:38 ` Tejun Heo
2017-11-17 6:51 ` Hou Tao
2017-11-20 16:43 ` Jan Kara
2017-11-24 3:12 ` Hou Tao [this message]
2017-03-13 15:14 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
2017-03-13 15:14 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
2017-03-21 16:21 ` Tejun Heo
2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
2017-03-13 15:14 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
2017-03-13 15:14 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
2017-03-13 15:14 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
2017-03-13 18:10 ` [PATCH 0/11 v4] block: Fix block device shutdown related races Dan Williams
2017-03-21 0:57 ` Thiago Jung Bauermann
-- strict thread matches above, loose matches on Subject: below --
2017-03-06 16:33 [PATCH 0/11 v3] " Jan Kara
2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-06 22:04 ` Tejun Heo
2017-03-07 11:14 ` Jan Kara
2017-03-07 19:43 ` Tejun Heo
2017-03-08 14:25 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=776fb2d9-3ea4-48a3-3de1-7040d6047cd3@huawei.com \
--to=houtao1@huawei.com \
--cc=axboe@kernel.dk \
--cc=bauerman@linux.vnet.ibm.com \
--cc=dan.j.williams@intel.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-block@vger.kernel.org \
--cc=osandov@osandov.com \
--cc=tahsin@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).