linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>, Andi Kleen <ak@linux.intel.com>,
	Rabin Vincent <rabinv@axis.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	linux-block@vger.kernel.org, Jeff Moyer <jmoyer@redhat.com>,
	Wei Fang <fangwei1@huawei.com>, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
Date: Fri, 6 Jan 2017 09:45:45 -0800	[thread overview]
Message-ID: <CAPcyv4gmvGkAbse619WOBsEd8DDn8Fe4kVJ7r+iYug8DEg+cEg@mail.gmail.com> (raw)
In-Reply-To: <20170106102330.GB3533@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 6118 bytes --]

On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 05-01-17 17:17:55, Dan Williams wrote:
>> The ->bd_queue member of struct block_device was added in commit
>> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
>> v3.3. However, blk_get_backing_dev_info() has been using
>> bdev_get_queue() and grabbing the request_queue through the gendisk
>> since before the git era.
>>
>> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
>> not. The queue remains valid until the final put of the parent disk.
>>
>> The following crash signature results from blk_get_backing_dev_info()
>> trying to lookup the queue through ->bd_disk after the final put of the
>> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
>> which is guaranteed to still be valid at invalidate_partition() time.
>>
>>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
>>  IP: blk_get_backing_dev_info+0x10/0x20
>>  [..]
>>  Call Trace:
>>   __inode_attach_wb+0x3a7/0x5d0
>>   __filemap_fdatawrite_range+0xf8/0x100
>>   filemap_write_and_wait+0x40/0x90
>>   fsync_bdev+0x54/0x60
>>   ? bdget_disk+0x30/0x40
>>   invalidate_partition+0x24/0x50
>>   del_gendisk+0xfa/0x230
>
> So we have a similar reports of the same problem. E.g.:
>
> http://www.spinics.net/lists/linux-fsdevel/msg105153.html
>
> However I kind of miss how your patch would fix all those cases. The
> principial problem is that inode_to_bdi() called on block device inode
> wants to get the backing_dev_info however on last close of a block device
> we do put_disk() and thus the request queue containing backing_dev_info
> does not have to be around at that time. In your case you are lucky enough
> to have the containing disk still around but that's not the case for all
> inode_to_bdi() users (see e.g. the report I referenced) and your patch
> would change relatively straightforward NULL pointer dereference to rather
> subtle use-after-free issue

True. If there are other cases that don't hold their own queue
reference this patch makes things worse.

> so I disagree with going down this path.

I still think this patch is the right thing to do, but it needs to
come after the wider guarantee that having an active bdev reference
guarantees that the queue and backing_dev_info are still allocated.

> So what I think needs to be done is that we make backing_dev_info
> independently allocated structure with different lifetime rules to gendisk
> or request_queue - definitely I want it to live as long as block device
> inode exists. However it needs more thought what the exact lifetime rules
> will be.

Hmm, why does it need to be separately allocated?

Something like this, passes the libnvdimm unit tests: (non-whitespace
damaged version attached)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7c1a24fb09d8..6742c49b38e4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);

 void blk_put_queue(struct request_queue *q)
 {
-       kobject_put(&q->kobj);
+       if (q)
+               kobject_put(&q->kobj);
 }
 EXPORT_SYMBOL(blk_put_queue);

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 05b553368bb4..7503110e37ff 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct
super_block *sb)
        return &ei->vfs_inode;
 }

+static void bdev_release(struct work_struct *w)
+{
+       struct block_device *bdev = container_of(w, typeof(*bdev), bd_release);
+       struct bdev_inode *bdi = container_of(bdev, typeof(*bdi), bdev);
+
+       blk_put_queue(bdev->bd_queue);
+       kmem_cache_free(bdev_cachep, bdi);
+}
+
 static void bdev_i_callback(struct rcu_head *head)
 {
        struct inode *inode = container_of(head, struct inode, i_rcu);
-       struct bdev_inode *bdi = BDEV_I(inode);
+       struct block_device *bdev = &BDEV_I(inode)->bdev;

-       kmem_cache_free(bdev_cachep, bdi);
+       /* blk_put_queue needs process context */
+       schedule_work(&bdev->bd_release);
 }

 static void bdev_destroy_inode(struct inode *inode)
@@ -613,6 +623,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
        INIT_LIST_HEAD(&bdev->bd_holder_disks);
 #endif
+       INIT_WORK(&bdev->bd_release, bdev_release);
        inode_init_once(&ei->vfs_inode);
        /* Initialize mutex for freeze. */
        mutex_init(&bdev->bd_fsfreeze_mutex);
@@ -1268,6 +1279,8 @@ static int __blkdev_get(struct block_device
*bdev, fmode_t mode, int for_part)
        mutex_lock_nested(&bdev->bd_mutex, for_part);
        if (!bdev->bd_openers) {
                bdev->bd_disk = disk;
+               if (!blk_get_queue(disk->queue))
+                       goto out_clear;
                bdev->bd_queue = disk->queue;
                bdev->bd_contains = bdev;

@@ -1288,7 +1301,6 @@ static int __blkdev_get(struct block_device
*bdev, fmode_t mode, int for_part)
                                        disk_put_part(bdev->bd_part);
                                        bdev->bd_part = NULL;
                                        bdev->bd_disk = NULL;
-                                       bdev->bd_queue = NULL;
                                        mutex_unlock(&bdev->bd_mutex);
                                        disk_unblock_events(disk);
                                        put_disk(disk);
@@ -1364,7 +1376,6 @@ static int __blkdev_get(struct block_device
*bdev, fmode_t mode, int for_part)
        disk_put_part(bdev->bd_part);
        bdev->bd_disk = NULL;
        bdev->bd_part = NULL;
-       bdev->bd_queue = NULL;
        if (bdev != bdev->bd_contains)
                __blkdev_put(bdev->bd_contains, mode, 1);
        bdev->bd_contains = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0478c07b2a..f084e4a2198b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -488,6 +488,7 @@ struct block_device {
        int                     bd_fsfreeze_count;
        /* Mutex for freeze */
        struct mutex            bd_fsfreeze_mutex;
+       struct work_struct      bd_release;
 };

 /*

[-- Attachment #2: bdev-queue-lifetime.patch --]
[-- Type: text/x-patch, Size: 2777 bytes --]

diff --git a/block/blk-core.c b/block/blk-core.c
index 7c1a24fb09d8..6742c49b38e4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);
 
 void blk_put_queue(struct request_queue *q)
 {
-	kobject_put(&q->kobj);
+	if (q)
+		kobject_put(&q->kobj);
 }
 EXPORT_SYMBOL(blk_put_queue);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 05b553368bb4..7503110e37ff 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
+static void bdev_release(struct work_struct *w)
+{
+	struct block_device *bdev = container_of(w, typeof(*bdev), bd_release);
+	struct bdev_inode *bdi = container_of(bdev, typeof(*bdi), bdev);
+
+	blk_put_queue(bdev->bd_queue);
+	kmem_cache_free(bdev_cachep, bdi);
+}
+
 static void bdev_i_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
-	struct bdev_inode *bdi = BDEV_I(inode);
+	struct block_device *bdev = &BDEV_I(inode)->bdev;
 
-	kmem_cache_free(bdev_cachep, bdi);
+	/* blk_put_queue needs process context */
+	schedule_work(&bdev->bd_release);
 }
 
 static void bdev_destroy_inode(struct inode *inode)
@@ -613,6 +623,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
 	INIT_LIST_HEAD(&bdev->bd_holder_disks);
 #endif
+	INIT_WORK(&bdev->bd_release, bdev_release);
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
@@ -1268,6 +1279,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	mutex_lock_nested(&bdev->bd_mutex, for_part);
 	if (!bdev->bd_openers) {
 		bdev->bd_disk = disk;
+		if (!blk_get_queue(disk->queue))
+			goto out_clear;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
 
@@ -1288,7 +1301,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 					disk_put_part(bdev->bd_part);
 					bdev->bd_part = NULL;
 					bdev->bd_disk = NULL;
-					bdev->bd_queue = NULL;
 					mutex_unlock(&bdev->bd_mutex);
 					disk_unblock_events(disk);
 					put_disk(disk);
@@ -1364,7 +1376,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_queue = NULL;
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0478c07b2a..f084e4a2198b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -488,6 +488,7 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	struct work_struct	bd_release;
 };
 
 /*

  reply	other threads:[~2017-01-06 17:45 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-06  1:17 [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue Dan Williams
2017-01-06 10:23 ` Jan Kara
2017-01-06 17:45   ` Dan Williams [this message]
2017-01-08 19:46     ` Jan Kara
2017-01-08 20:50       ` Dan Williams
2017-01-10  1:59         ` Dan Williams
2017-01-10 15:57   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4gmvGkAbse619WOBsEd8DDn8Fe4kVJ7r+iYug8DEg+cEg@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=ak@linux.intel.com \
    --cc=axboe@fb.com \
    --cc=fangwei1@huawei.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rabinv@axis.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).