linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
@ 2017-01-06  1:17 Dan Williams
  2017-01-06 10:23 ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Dan Williams @ 2017-01-06  1:17 UTC (permalink / raw)
  To: axboe
  Cc: Andi Kleen, Jan Kara, Rabin Vincent, linux-kernel, stable,
	linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

The ->bd_queue member of struct block_device was added in commit
87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
v3.3. However, blk_get_backing_dev_info() has been using
bdev_get_queue() and grabbing the request_queue through the gendisk
since before the git era.

At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
not. The queue remains valid until the final put of the parent disk.

The following crash signature results from blk_get_backing_dev_info()
trying to lookup the queue through ->bd_disk after the final put of the
block device. Simply switch bdev_get_queue() to use ->bd_queue directly
which is guaranteed to still be valid at invalidate_partition() time.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
 IP: blk_get_backing_dev_info+0x10/0x20
 [..]
 Call Trace:
  __inode_attach_wb+0x3a7/0x5d0
  __filemap_fdatawrite_range+0xf8/0x100
  filemap_write_and_wait+0x40/0x90
  fsync_bdev+0x54/0x60
  ? bdget_disk+0x30/0x40
  invalidate_partition+0x24/0x50
  del_gendisk+0xfa/0x230

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wei Fang <fangwei1@huawei.com>
Cc: Rabin Vincent <rabinv@axis.com>
Cc: Andi Kleen <ak@linux.intel.com>
Fixes: 87192a2a49c4 ("vfs: cache request_queue in struct block_device")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       |    4 ++--
 include/linux/blkdev.h |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..04737548e1e1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -109,8 +109,8 @@ void blk_queue_congestion_threshold(struct request_queue *q)
  * @bdev:	device
  *
  * Locates the passed device's request queue and returns the address of its
- * backing_dev_info.  This function can only be called if @bdev is opened
- * and the return value is never NULL.
+ * backing_dev_info.  This function can only be called while there is an
+ * active reference against the parent gendisk.
  */
 struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 83695641bd5e..27779709f821 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -962,7 +962,7 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
-	return bdev->bd_disk->queue;	/* this is never NULL */
+	return bdev->bd_queue;	/* this is never NULL */
 }
 
 /*

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
  2017-01-06  1:17 [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue Dan Williams
@ 2017-01-06 10:23 ` Jan Kara
  2017-01-06 17:45   ` Dan Williams
  2017-01-10 15:57   ` Christoph Hellwig
  0 siblings, 2 replies; 7+ messages in thread
From: Jan Kara @ 2017-01-06 10:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Andi Kleen, Jan Kara, Rabin Vincent, linux-kernel, stable,
	linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

On Thu 05-01-17 17:17:55, Dan Williams wrote:
> The ->bd_queue member of struct block_device was added in commit
> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
> v3.3. However, blk_get_backing_dev_info() has been using
> bdev_get_queue() and grabbing the request_queue through the gendisk
> since before the git era.
> 
> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
> not. The queue remains valid until the final put of the parent disk.
> 
> The following crash signature results from blk_get_backing_dev_info()
> trying to lookup the queue through ->bd_disk after the final put of the
> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
> which is guaranteed to still be valid at invalidate_partition() time.
> 
>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
>  IP: blk_get_backing_dev_info+0x10/0x20
>  [..]
>  Call Trace:
>   __inode_attach_wb+0x3a7/0x5d0
>   __filemap_fdatawrite_range+0xf8/0x100
>   filemap_write_and_wait+0x40/0x90
>   fsync_bdev+0x54/0x60
>   ? bdget_disk+0x30/0x40
>   invalidate_partition+0x24/0x50
>   del_gendisk+0xfa/0x230

So we have a similar reports of the same problem. E.g.:

http://www.spinics.net/lists/linux-fsdevel/msg105153.html

However I kind of miss how your patch would fix all those cases. The
principial problem is that inode_to_bdi() called on block device inode
wants to get the backing_dev_info however on last close of a block device
we do put_disk() and thus the request queue containing backing_dev_info
does not have to be around at that time. In your case you are lucky enough
to have the containing disk still around but that's not the case for all
inode_to_bdi() users (see e.g. the report I referenced) and your patch
would change relatively straightforward NULL pointer dereference to rather
subtle use-after-free issue so I disagree with going down this path.

So what I think needs to be done is that we make backing_dev_info
independently allocated structure with different lifetime rules to gendisk
or request_queue - definitely I want it to live as long as block device
inode exists. However it needs more thought what the exact lifetime rules
will be.

								Honza
> 
> Cc: <stable@vger.kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Wei Fang <fangwei1@huawei.com>
> Cc: Rabin Vincent <rabinv@axis.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Fixes: 87192a2a49c4 ("vfs: cache request_queue in struct block_device")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  block/blk-core.c       |    4 ++--
>  include/linux/blkdev.h |    2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 61ba08c58b64..04737548e1e1 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -109,8 +109,8 @@ void blk_queue_congestion_threshold(struct request_queue *q)
>   * @bdev:	device
>   *
>   * Locates the passed device's request queue and returns the address of its
> - * backing_dev_info.  This function can only be called if @bdev is opened
> - * and the return value is never NULL.
> + * backing_dev_info.  This function can only be called while there is an
> + * active reference against the parent gendisk.
>   */
>  struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
>  {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 83695641bd5e..27779709f821 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -962,7 +962,7 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
>  
>  static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>  {
> -	return bdev->bd_disk->queue;	/* this is never NULL */
> +	return bdev->bd_queue;	/* this is never NULL */
>  }
>  
>  /*
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
  2017-01-06 10:23 ` Jan Kara
@ 2017-01-06 17:45   ` Dan Williams
  2017-01-08 19:46     ` Jan Kara
  2017-01-10 15:57   ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Dan Williams @ 2017-01-06 17:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Andi Kleen, Rabin Vincent, linux-kernel, stable,
	linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 6118 bytes --]

On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 05-01-17 17:17:55, Dan Williams wrote:
>> The ->bd_queue member of struct block_device was added in commit
>> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
>> v3.3. However, blk_get_backing_dev_info() has been using
>> bdev_get_queue() and grabbing the request_queue through the gendisk
>> since before the git era.
>>
>> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
>> not. The queue remains valid until the final put of the parent disk.
>>
>> The following crash signature results from blk_get_backing_dev_info()
>> trying to lookup the queue through ->bd_disk after the final put of the
>> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
>> which is guaranteed to still be valid at invalidate_partition() time.
>>
>>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
>>  IP: blk_get_backing_dev_info+0x10/0x20
>>  [..]
>>  Call Trace:
>>   __inode_attach_wb+0x3a7/0x5d0
>>   __filemap_fdatawrite_range+0xf8/0x100
>>   filemap_write_and_wait+0x40/0x90
>>   fsync_bdev+0x54/0x60
>>   ? bdget_disk+0x30/0x40
>>   invalidate_partition+0x24/0x50
>>   del_gendisk+0xfa/0x230
>
> So we have a similar reports of the same problem. E.g.:
>
> http://www.spinics.net/lists/linux-fsdevel/msg105153.html
>
> However I kind of miss how your patch would fix all those cases. The
> principial problem is that inode_to_bdi() called on block device inode
> wants to get the backing_dev_info however on last close of a block device
> we do put_disk() and thus the request queue containing backing_dev_info
> does not have to be around at that time. In your case you are lucky enough
> to have the containing disk still around but that's not the case for all
> inode_to_bdi() users (see e.g. the report I referenced) and your patch
> would change relatively straightforward NULL pointer dereference to rather
> subtle use-after-free issue

True. If there are other cases that don't hold their own queue
reference this patch makes things worse.

> so I disagree with going down this path.

I still think this patch is the right thing to do, but it needs to
come after the wider guarantee that having an active bdev reference
guarantees that the queue and backing_dev_info are still allocated.

> So what I think needs to be done is that we make backing_dev_info
> independently allocated structure with different lifetime rules to gendisk
> or request_queue - definitely I want it to live as long as block device
> inode exists. However it needs more thought what the exact lifetime rules
> will be.

Hmm, why does it need to be separately allocated?

Something like this, passes the libnvdimm unit tests: (non-whitespace
damaged version attached)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7c1a24fb09d8..6742c49b38e4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);

 void blk_put_queue(struct request_queue *q)
 {
-       kobject_put(&q->kobj);
+       if (q)
+               kobject_put(&q->kobj);
 }
 EXPORT_SYMBOL(blk_put_queue);

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 05b553368bb4..7503110e37ff 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct
super_block *sb)
        return &ei->vfs_inode;
 }

+static void bdev_release(struct work_struct *w)
+{
+       struct block_device *bdev = container_of(w, typeof(*bdev), bd_release);
+       struct bdev_inode *bdi = container_of(bdev, typeof(*bdi), bdev);
+
+       blk_put_queue(bdev->bd_queue);
+       kmem_cache_free(bdev_cachep, bdi);
+}
+
 static void bdev_i_callback(struct rcu_head *head)
 {
        struct inode *inode = container_of(head, struct inode, i_rcu);
-       struct bdev_inode *bdi = BDEV_I(inode);
+       struct block_device *bdev = &BDEV_I(inode)->bdev;

-       kmem_cache_free(bdev_cachep, bdi);
+       /* blk_put_queue needs process context */
+       schedule_work(&bdev->bd_release);
 }

 static void bdev_destroy_inode(struct inode *inode)
@@ -613,6 +623,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
        INIT_LIST_HEAD(&bdev->bd_holder_disks);
 #endif
+       INIT_WORK(&bdev->bd_release, bdev_release);
        inode_init_once(&ei->vfs_inode);
        /* Initialize mutex for freeze. */
        mutex_init(&bdev->bd_fsfreeze_mutex);
@@ -1268,6 +1279,8 @@ static int __blkdev_get(struct block_device
*bdev, fmode_t mode, int for_part)
        mutex_lock_nested(&bdev->bd_mutex, for_part);
        if (!bdev->bd_openers) {
                bdev->bd_disk = disk;
+               if (!blk_get_queue(disk->queue))
+                       goto out_clear;
                bdev->bd_queue = disk->queue;
                bdev->bd_contains = bdev;

@@ -1288,7 +1301,6 @@ static int __blkdev_get(struct block_device
*bdev, fmode_t mode, int for_part)
                                        disk_put_part(bdev->bd_part);
                                        bdev->bd_part = NULL;
                                        bdev->bd_disk = NULL;
-                                       bdev->bd_queue = NULL;
                                        mutex_unlock(&bdev->bd_mutex);
                                        disk_unblock_events(disk);
                                        put_disk(disk);
@@ -1364,7 +1376,6 @@ static int __blkdev_get(struct block_device
*bdev, fmode_t mode, int for_part)
        disk_put_part(bdev->bd_part);
        bdev->bd_disk = NULL;
        bdev->bd_part = NULL;
-       bdev->bd_queue = NULL;
        if (bdev != bdev->bd_contains)
                __blkdev_put(bdev->bd_contains, mode, 1);
        bdev->bd_contains = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0478c07b2a..f084e4a2198b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -488,6 +488,7 @@ struct block_device {
        int                     bd_fsfreeze_count;
        /* Mutex for freeze */
        struct mutex            bd_fsfreeze_mutex;
+       struct work_struct      bd_release;
 };

 /*

[-- Attachment #2: bdev-queue-lifetime.patch --]
[-- Type: text/x-patch, Size: 2777 bytes --]

diff --git a/block/blk-core.c b/block/blk-core.c
index 7c1a24fb09d8..6742c49b38e4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);
 
 void blk_put_queue(struct request_queue *q)
 {
-	kobject_put(&q->kobj);
+	if (q)
+		kobject_put(&q->kobj);
 }
 EXPORT_SYMBOL(blk_put_queue);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 05b553368bb4..7503110e37ff 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
+static void bdev_release(struct work_struct *w)
+{
+	struct block_device *bdev = container_of(w, typeof(*bdev), bd_release);
+	struct bdev_inode *bdi = container_of(bdev, typeof(*bdi), bdev);
+
+	blk_put_queue(bdev->bd_queue);
+	kmem_cache_free(bdev_cachep, bdi);
+}
+
 static void bdev_i_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
-	struct bdev_inode *bdi = BDEV_I(inode);
+	struct block_device *bdev = &BDEV_I(inode)->bdev;
 
-	kmem_cache_free(bdev_cachep, bdi);
+	/* blk_put_queue needs process context */
+	schedule_work(&bdev->bd_release);
 }
 
 static void bdev_destroy_inode(struct inode *inode)
@@ -613,6 +623,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
 	INIT_LIST_HEAD(&bdev->bd_holder_disks);
 #endif
+	INIT_WORK(&bdev->bd_release, bdev_release);
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
@@ -1268,6 +1279,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	mutex_lock_nested(&bdev->bd_mutex, for_part);
 	if (!bdev->bd_openers) {
 		bdev->bd_disk = disk;
+		if (!blk_get_queue(disk->queue))
+			goto out_clear;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
 
@@ -1288,7 +1301,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 					disk_put_part(bdev->bd_part);
 					bdev->bd_part = NULL;
 					bdev->bd_disk = NULL;
-					bdev->bd_queue = NULL;
 					mutex_unlock(&bdev->bd_mutex);
 					disk_unblock_events(disk);
 					put_disk(disk);
@@ -1364,7 +1376,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_queue = NULL;
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0478c07b2a..f084e4a2198b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -488,6 +488,7 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	struct work_struct	bd_release;
 };
 
 /*

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
  2017-01-06 17:45   ` Dan Williams
@ 2017-01-08 19:46     ` Jan Kara
  2017-01-08 20:50       ` Dan Williams
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2017-01-08 19:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Jens Axboe, Andi Kleen, Rabin Vincent, linux-kernel,
	stable, linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

On Fri 06-01-17 09:45:45, Dan Williams wrote:
> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 05-01-17 17:17:55, Dan Williams wrote:
> >> The ->bd_queue member of struct block_device was added in commit
> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
> >> v3.3. However, blk_get_backing_dev_info() has been using
> >> bdev_get_queue() and grabbing the request_queue through the gendisk
> >> since before the git era.
> >>
> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
> >> not. The queue remains valid until the final put of the parent disk.
> >>
> >> The following crash signature results from blk_get_backing_dev_info()
> >> trying to lookup the queue through ->bd_disk after the final put of the
> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
> >> which is guaranteed to still be valid at invalidate_partition() time.
> >>
> >>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
> >>  IP: blk_get_backing_dev_info+0x10/0x20
> >>  [..]
> >>  Call Trace:
> >>   __inode_attach_wb+0x3a7/0x5d0
> >>   __filemap_fdatawrite_range+0xf8/0x100
> >>   filemap_write_and_wait+0x40/0x90
> >>   fsync_bdev+0x54/0x60
> >>   ? bdget_disk+0x30/0x40
> >>   invalidate_partition+0x24/0x50
> >>   del_gendisk+0xfa/0x230
> >
> > So we have a similar reports of the same problem. E.g.:
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html
> >
> > However I kind of miss how your patch would fix all those cases. The
> > principial problem is that inode_to_bdi() called on block device inode
> > wants to get the backing_dev_info however on last close of a block device
> > we do put_disk() and thus the request queue containing backing_dev_info
> > does not have to be around at that time. In your case you are lucky enough
> > to have the containing disk still around but that's not the case for all
> > inode_to_bdi() users (see e.g. the report I referenced) and your patch
> > would change relatively straightforward NULL pointer dereference to rather
> > subtle use-after-free issue
> 
> True. If there are other cases that don't hold their own queue
> reference this patch makes things worse.
> 
> > so I disagree with going down this path.
> 
> I still think this patch is the right thing to do, but it needs to
> come after the wider guarantee that having an active bdev reference
> guarantees that the queue and backing_dev_info are still allocated.
> 
> > So what I think needs to be done is that we make backing_dev_info
> > independently allocated structure with different lifetime rules to gendisk
> > or request_queue - definitely I want it to live as long as block device
> > inode exists. However it needs more thought what the exact lifetime rules
> > will be.
> 
> Hmm, why does it need to be separately allocated?
> 
> Something like this, passes the libnvdimm unit tests: (non-whitespace
> damaged version attached)

So the problem with this approach is that request queue will be pinned while
bdev inode exists. And how long that is is impossible to predict or influence
from userspace so e.g. you cannot remove device driver from memory and even
unplugging USB after it has been unmounted would suddently go via a path of
"device removed while it is used" which can have unexpected consequences. I
guess Jens or Christoph will know more about the details...

I have prototyped patches which split backing_dev_info from request_queue
and it was not even that difficult in the end. I'll give those patches some
testing and post them for comments...

								Honza

> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7c1a24fb09d8..6742c49b38e4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);
> 
>  void blk_put_queue(struct request_queue *q)
>  {
> -       kobject_put(&q->kobj);
> +       if (q)
> +               kobject_put(&q->kobj);
>  }
>  EXPORT_SYMBOL(blk_put_queue);
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 05b553368bb4..7503110e37ff 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct
> super_block *sb)
>         return &ei->vfs_inode;
>  }
> 
> +static void bdev_release(struct work_struct *w)
> +{
> +       struct block_device *bdev = container_of(w, typeof(*bdev), bd_release);
> +       struct bdev_inode *bdi = container_of(bdev, typeof(*bdi), bdev);
> +
> +       blk_put_queue(bdev->bd_queue);
> +       kmem_cache_free(bdev_cachep, bdi);
> +}
> +
>  static void bdev_i_callback(struct rcu_head *head)
>  {
>         struct inode *inode = container_of(head, struct inode, i_rcu);
> -       struct bdev_inode *bdi = BDEV_I(inode);
> +       struct block_device *bdev = &BDEV_I(inode)->bdev;
> 
> -       kmem_cache_free(bdev_cachep, bdi);
> +       /* blk_put_queue needs process context */
> +       schedule_work(&bdev->bd_release);
>  }
> 
>  static void bdev_destroy_inode(struct inode *inode)
> @@ -613,6 +623,7 @@ static void init_once(void *foo)
>  #ifdef CONFIG_SYSFS
>         INIT_LIST_HEAD(&bdev->bd_holder_disks);
>  #endif
> +       INIT_WORK(&bdev->bd_release, bdev_release);
>         inode_init_once(&ei->vfs_inode);
>         /* Initialize mutex for freeze. */
>         mutex_init(&bdev->bd_fsfreeze_mutex);
> @@ -1268,6 +1279,8 @@ static int __blkdev_get(struct block_device
> *bdev, fmode_t mode, int for_part)
>         mutex_lock_nested(&bdev->bd_mutex, for_part);
>         if (!bdev->bd_openers) {
>                 bdev->bd_disk = disk;
> +               if (!blk_get_queue(disk->queue))
> +                       goto out_clear;
>                 bdev->bd_queue = disk->queue;
>                 bdev->bd_contains = bdev;
> 
> @@ -1288,7 +1301,6 @@ static int __blkdev_get(struct block_device
> *bdev, fmode_t mode, int for_part)
>                                         disk_put_part(bdev->bd_part);
>                                         bdev->bd_part = NULL;
>                                         bdev->bd_disk = NULL;
> -                                       bdev->bd_queue = NULL;
>                                         mutex_unlock(&bdev->bd_mutex);
>                                         disk_unblock_events(disk);
>                                         put_disk(disk);
> @@ -1364,7 +1376,6 @@ static int __blkdev_get(struct block_device
> *bdev, fmode_t mode, int for_part)
>         disk_put_part(bdev->bd_part);
>         bdev->bd_disk = NULL;
>         bdev->bd_part = NULL;
> -       bdev->bd_queue = NULL;
>         if (bdev != bdev->bd_contains)
>                 __blkdev_put(bdev->bd_contains, mode, 1);
>         bdev->bd_contains = NULL;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dc0478c07b2a..f084e4a2198b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -488,6 +488,7 @@ struct block_device {
>         int                     bd_fsfreeze_count;
>         /* Mutex for freeze */
>         struct mutex            bd_fsfreeze_mutex;
> +       struct work_struct      bd_release;
>  };
> 
>  /*

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7c1a24fb09d8..6742c49b38e4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);
>  
>  void blk_put_queue(struct request_queue *q)
>  {
> -	kobject_put(&q->kobj);
> +	if (q)
> +		kobject_put(&q->kobj);
>  }
>  EXPORT_SYMBOL(blk_put_queue);
>  
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 05b553368bb4..7503110e37ff 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
>  	return &ei->vfs_inode;
>  }
>  
> +static void bdev_release(struct work_struct *w)
> +{
> +	struct block_device *bdev = container_of(w, typeof(*bdev), bd_release);
> +	struct bdev_inode *bdi = container_of(bdev, typeof(*bdi), bdev);
> +
> +	blk_put_queue(bdev->bd_queue);
> +	kmem_cache_free(bdev_cachep, bdi);
> +}
> +
>  static void bdev_i_callback(struct rcu_head *head)
>  {
>  	struct inode *inode = container_of(head, struct inode, i_rcu);
> -	struct bdev_inode *bdi = BDEV_I(inode);
> +	struct block_device *bdev = &BDEV_I(inode)->bdev;
>  
> -	kmem_cache_free(bdev_cachep, bdi);
> +	/* blk_put_queue needs process context */
> +	schedule_work(&bdev->bd_release);
>  }
>  
>  static void bdev_destroy_inode(struct inode *inode)
> @@ -613,6 +623,7 @@ static void init_once(void *foo)
>  #ifdef CONFIG_SYSFS
>  	INIT_LIST_HEAD(&bdev->bd_holder_disks);
>  #endif
> +	INIT_WORK(&bdev->bd_release, bdev_release);
>  	inode_init_once(&ei->vfs_inode);
>  	/* Initialize mutex for freeze. */
>  	mutex_init(&bdev->bd_fsfreeze_mutex);
> @@ -1268,6 +1279,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  	mutex_lock_nested(&bdev->bd_mutex, for_part);
>  	if (!bdev->bd_openers) {
>  		bdev->bd_disk = disk;
> +		if (!blk_get_queue(disk->queue))
> +			goto out_clear;
>  		bdev->bd_queue = disk->queue;
>  		bdev->bd_contains = bdev;
>  
> @@ -1288,7 +1301,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  					disk_put_part(bdev->bd_part);
>  					bdev->bd_part = NULL;
>  					bdev->bd_disk = NULL;
> -					bdev->bd_queue = NULL;
>  					mutex_unlock(&bdev->bd_mutex);
>  					disk_unblock_events(disk);
>  					put_disk(disk);
> @@ -1364,7 +1376,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  	disk_put_part(bdev->bd_part);
>  	bdev->bd_disk = NULL;
>  	bdev->bd_part = NULL;
> -	bdev->bd_queue = NULL;
>  	if (bdev != bdev->bd_contains)
>  		__blkdev_put(bdev->bd_contains, mode, 1);
>  	bdev->bd_contains = NULL;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dc0478c07b2a..f084e4a2198b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -488,6 +488,7 @@ struct block_device {
>  	int			bd_fsfreeze_count;
>  	/* Mutex for freeze */
>  	struct mutex		bd_fsfreeze_mutex;
> +	struct work_struct	bd_release;
>  };
>  
>  /*

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
  2017-01-08 19:46     ` Jan Kara
@ 2017-01-08 20:50       ` Dan Williams
  2017-01-10  1:59         ` Dan Williams
  0 siblings, 1 reply; 7+ messages in thread
From: Dan Williams @ 2017-01-08 20:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Andi Kleen, Rabin Vincent, linux-kernel, stable,
	linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

On Sun, Jan 8, 2017 at 11:46 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 06-01-17 09:45:45, Dan Williams wrote:
>> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Thu 05-01-17 17:17:55, Dan Williams wrote:
>> >> The ->bd_queue member of struct block_device was added in commit
>> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
>> >> v3.3. However, blk_get_backing_dev_info() has been using
>> >> bdev_get_queue() and grabbing the request_queue through the gendisk
>> >> since before the git era.
>> >>
>> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
>> >> not. The queue remains valid until the final put of the parent disk.
>> >>
>> >> The following crash signature results from blk_get_backing_dev_info()
>> >> trying to lookup the queue through ->bd_disk after the final put of the
>> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
>> >> which is guaranteed to still be valid at invalidate_partition() time.
>> >>
>> >>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
>> >>  IP: blk_get_backing_dev_info+0x10/0x20
>> >>  [..]
>> >>  Call Trace:
>> >>   __inode_attach_wb+0x3a7/0x5d0
>> >>   __filemap_fdatawrite_range+0xf8/0x100
>> >>   filemap_write_and_wait+0x40/0x90
>> >>   fsync_bdev+0x54/0x60
>> >>   ? bdget_disk+0x30/0x40
>> >>   invalidate_partition+0x24/0x50
>> >>   del_gendisk+0xfa/0x230
>> >
>> > So we have a similar reports of the same problem. E.g.:
>> >
>> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html
>> >
>> > However I kind of miss how your patch would fix all those cases. The
>> > principial problem is that inode_to_bdi() called on block device inode
>> > wants to get the backing_dev_info however on last close of a block device
>> > we do put_disk() and thus the request queue containing backing_dev_info
>> > does not have to be around at that time. In your case you are lucky enough
>> > to have the containing disk still around but that's not the case for all
>> > inode_to_bdi() users (see e.g. the report I referenced) and your patch
>> > would change relatively straightforward NULL pointer dereference to rather
>> > subtle use-after-free issue
>>
>> True. If there are other cases that don't hold their own queue
>> reference this patch makes things worse.
>>
>> > so I disagree with going down this path.
>>
>> I still think this patch is the right thing to do, but it needs to
>> come after the wider guarantee that having an active bdev reference
>> guarantees that the queue and backing_dev_info are still allocated.
>>
>> > So what I think needs to be done is that we make backing_dev_info
>> > independently allocated structure with different lifetime rules to gendisk
>> > or request_queue - definitely I want it to live as long as block device
>> > inode exists. However it needs more thought what the exact lifetime rules
>> > will be.
>>
>> Hmm, why does it need to be separately allocated?
>>
>> Something like this, passes the libnvdimm unit tests: (non-whitespace
>> damaged version attached)
>
> So the problem with this approach is that request queue will be pinned while
> bdev inode exists. And how long that is is impossible to predict or influence
> from userspace so e.g. you cannot remove device driver from memory and even
> unplugging USB after it has been unmounted would suddently go via a path of
> "device removed while it is used" which can have unexpected consequences. I
> guess Jens or Christoph will know more about the details...

We do have the "block, fs: reliably communicate bdev end-of-life"
effort that I need to revisit:

   http://www.spinics.net/lists/linux-fsdevel/msg93312.html

...but I don't immediately see how keeping the request_queue around
longer makes the situation worse?

> I have prototyped patches which split backing_dev_info from request_queue
> and it was not even that difficult in the end. I'll give those patches some
> testing and post them for comments...

As long as the crashes are addressed I don't much care which solution
goes upstream.

That said I think it is unfortunate that we introduced ->bd_queue in
v3.3, but most users are still pointer chasing through
->bd_disk->queue. I'll see if the 0day robot can find any performance
benefit to this change outside of trying to fix a bdev-unplug crash.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
  2017-01-08 20:50       ` Dan Williams
@ 2017-01-10  1:59         ` Dan Williams
  0 siblings, 0 replies; 7+ messages in thread
From: Dan Williams @ 2017-01-10  1:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Andi Kleen, Rabin Vincent, linux-kernel, stable,
	linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

On Sun, Jan 8, 2017 at 12:50 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, Jan 8, 2017 at 11:46 AM, Jan Kara <jack@suse.cz> wrote:
>> On Fri 06-01-17 09:45:45, Dan Williams wrote:
>>> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@suse.cz> wrote:
>>> > On Thu 05-01-17 17:17:55, Dan Williams wrote:
>>> >> The ->bd_queue member of struct block_device was added in commit
>>> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
>>> >> v3.3. However, blk_get_backing_dev_info() has been using
>>> >> bdev_get_queue() and grabbing the request_queue through the gendisk
>>> >> since before the git era.
>>> >>
>>> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
>>> >> not. The queue remains valid until the final put of the parent disk.
>>> >>
>>> >> The following crash signature results from blk_get_backing_dev_info()
>>> >> trying to lookup the queue through ->bd_disk after the final put of the
>>> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
>>> >> which is guaranteed to still be valid at invalidate_partition() time.
>>> >>
>>> >>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
>>> >>  IP: blk_get_backing_dev_info+0x10/0x20
>>> >>  [..]
>>> >>  Call Trace:
>>> >>   __inode_attach_wb+0x3a7/0x5d0
>>> >>   __filemap_fdatawrite_range+0xf8/0x100
>>> >>   filemap_write_and_wait+0x40/0x90
>>> >>   fsync_bdev+0x54/0x60
>>> >>   ? bdget_disk+0x30/0x40
>>> >>   invalidate_partition+0x24/0x50
>>> >>   del_gendisk+0xfa/0x230
>>> >
>>> > So we have a similar reports of the same problem. E.g.:
>>> >
>>> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html
>>> >
>>> > However I kind of miss how your patch would fix all those cases. The
>>> > principial problem is that inode_to_bdi() called on block device inode
>>> > wants to get the backing_dev_info however on last close of a block device
>>> > we do put_disk() and thus the request queue containing backing_dev_info
>>> > does not have to be around at that time. In your case you are lucky enough
>>> > to have the containing disk still around but that's not the case for all
>>> > inode_to_bdi() users (see e.g. the report I referenced) and your patch
>>> > would change relatively straightforward NULL pointer dereference to rather
>>> > subtle use-after-free issue
>>>
>>> True. If there are other cases that don't hold their own queue
>>> reference this patch makes things worse.
>>>
>>> > so I disagree with going down this path.
>>>
>>> I still think this patch is the right thing to do, but it needs to
>>> come after the wider guarantee that having an active bdev reference
>>> guarantees that the queue and backing_dev_info are still allocated.
>>>
>>> > So what I think needs to be done is that we make backing_dev_info
>>> > independently allocated structure with different lifetime rules to gendisk
>>> > or request_queue - definitely I want it to live as long as block device
>>> > inode exists. However it needs more thought what the exact lifetime rules
>>> > will be.
>>>
>>> Hmm, why does it need to be separately allocated?
>>>
>>> Something like this, passes the libnvdimm unit tests: (non-whitespace
>>> damaged version attached)
>>
>> So the problem with this approach is that request queue will be pinned while
>> bdev inode exists. And how long that is is impossible to predict or influence
>> from userspace so e.g. you cannot remove device driver from memory and even
>> unplugging USB after it has been unmounted would suddently go via a path of
>> "device removed while it is used" which can have unexpected consequences. I
>> guess Jens or Christoph will know more about the details...
>
> We do have the "block, fs: reliably communicate bdev end-of-life"
> effort that I need to revisit:
>
>    http://www.spinics.net/lists/linux-fsdevel/msg93312.html
>
> ...but I don't immediately see how keeping the request_queue around
> longer makes the situation worse?
>

Well the 0day kbuild robot and further testing with the libnvdimm unit
tests shows that bdi code is not ready for this release timing change.
So I retract it, sorry for the noise.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue
  2017-01-06 10:23 ` Jan Kara
  2017-01-06 17:45   ` Dan Williams
@ 2017-01-10 15:57   ` Christoph Hellwig
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2017-01-10 15:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, axboe, Andi Kleen, Rabin Vincent, linux-kernel,
	stable, linux-block, Jeff Moyer, Wei Fang, Christoph Hellwig

On Fri, Jan 06, 2017 at 11:23:30AM +0100, Jan Kara wrote:
> So what I think needs to be done is that we make backing_dev_info
> independently allocated structure with different lifetime rules to gendisk
> or request_queue - definitely I want it to live as long as block device
> inode exists. However it needs more thought what the exact lifetime rules
> will be.

Yes, moving the backing dev out (or keeping the request_queue memoery alive)
is something we need to do.  And we need to look into the other half
of the backing_dev life time isssues - we need to make sure the name
and sysfs resources are released as soon as the devices goes away so that
we aren't going to run into reuse races.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-01-10 15:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-06  1:17 [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue Dan Williams
2017-01-06 10:23 ` Jan Kara
2017-01-06 17:45   ` Dan Williams
2017-01-08 19:46     ` Jan Kara
2017-01-08 20:50       ` Dan Williams
2017-01-10  1:59         ` Dan Williams
2017-01-10 15:57   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).