All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] block: detach bdev inode from its wb in __blkdev_put()
@ 2015-11-20 21:22 Ilya Dryomov
  2015-11-20 21:23 ` Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ilya Dryomov @ 2015-11-20 21:22 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Tejun Heo, Christoph Hellwig, linux-fsdevel

Since 52ebea749aae ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback).  Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late.  Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget().  bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off.  In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518         if (bdev->bd_contains == bdev) {
1519                 if (disk->fops->release)
1520                         disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521         }
1522         if (!bdev->bd_openers) {
1523                 struct module *owner = disk->fops->owner;
1524
1525                 disk_put_part(bdev->bd_part);
1526                 bdev->bd_part = NULL;
1527                 bdev->bd_disk = NULL;
1528                 if (bdev != bdev->bd_contains)
1529                         victim = bdev->bd_contains;
1530                 bdev->bd_contains = NULL;
1531
1532                 put_disk(disk);

[ We put ours, the queue is gone
  The last bdput() would result in a write to invalid memory ]

1533                 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 fs/block_dev.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb0dfb1c7af1..6f8dd407c533 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1509,11 +1509,14 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		WARN_ON_ONCE(bdev->bd_holders);
 		sync_blockdev(bdev);
 		kill_bdev(bdev);
+
+		bdev_write_inode(bdev);
 		/*
-		 * ->release can cause the queue to disappear, so flush all
-		 * dirty data before.
+		 * Detaching bdev inode from its wb in __destroy_inode()
+		 * is too late: the queue which embeds its bdi (along with
+		 * root wb) can be gone as soon as we put_disk() below.
 		 */
-		bdev_write_inode(bdev);
+		inode_detach_wb(bdev->bd_inode);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-20 21:22 [PATCH] block: detach bdev inode from its wb in __blkdev_put() Ilya Dryomov
@ 2015-11-20 21:23 ` Tejun Heo
  2015-11-22 15:02 ` Christoph Hellwig
  2015-11-30  7:20 ` Raghavendra K T
  2 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2015-11-20 21:23 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alexander Viro, Christoph Hellwig, linux-fsdevel

On Fri, Nov 20, 2015 at 10:22:34PM +0100, Ilya Dryomov wrote:
> Since 52ebea749aae ("writeback: make backing_dev_info host
> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
> gets attached to a wb (struct bdi_writeback).  Detaching happens on
> evict, in inode_detach_wb() called from __destroy_inode(), and involves
> updating wb.
> 
> However, detaching an internal bdev inode from its wb in
> __destroy_inode() is too late.  Its bdi and by extension root wb are
> embedded into struct request_queue, which has different lifetime rules
> and can be freed long before the final bdput() is called (can be from
> __fput() of a corresponding /dev inode, through dput() - evict() -
> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
> opened; as soon as bdev is closed all bets are off.  In fact,
> disk/queue can be gone before __blkdev_put() even returns:
> 
> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
> 1500 {
> ...
> 1518         if (bdev->bd_contains == bdev) {
> 1519                 if (disk->fops->release)
> 1520                         disk->fops->release(disk, mode);
> 
> [ Driver puts its references to disk/queue ]
> 
> 1521         }
> 1522         if (!bdev->bd_openers) {
> 1523                 struct module *owner = disk->fops->owner;
> 1524
> 1525                 disk_put_part(bdev->bd_part);
> 1526                 bdev->bd_part = NULL;
> 1527                 bdev->bd_disk = NULL;
> 1528                 if (bdev != bdev->bd_contains)
> 1529                         victim = bdev->bd_contains;
> 1530                 bdev->bd_contains = NULL;
> 1531
> 1532                 put_disk(disk);
> 
> [ We put ours, the queue is gone
>   The last bdput() would result in a write to invalid memory ]
> 
> 1533                 module_put(owner);
> ...
> 1539 }
> 
> Since bdev inodes are special anyway, detach them in __blkdev_put()
> after clearing inode's dirty bits, turning the problematic
> inode_detach_wb() in __destroy_inode() into a noop.
> 
> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
> gendisk hold a reference to its queue"), so the old ->release comment
> is removed in favor of the new inode_detach_wb() comment.
> 
> Cc: stable@vger.kernel.org # 4.2+, needs backporting
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-20 21:22 [PATCH] block: detach bdev inode from its wb in __blkdev_put() Ilya Dryomov
  2015-11-20 21:23 ` Tejun Heo
@ 2015-11-22 15:02 ` Christoph Hellwig
  2015-11-23 15:35   ` Tejun Heo
  2015-11-30  7:20 ` Raghavendra K T
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2015-11-22 15:02 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alexander Viro, Tejun Heo, Christoph Hellwig, linux-fsdevel

The right fix is to never point from the inode to a BDI.  I fixed this
with a lot of effort, and the BDI writeback series put it back a little
later.  We need to revert that damage (not neseccarily literally, but the
effect).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-22 15:02 ` Christoph Hellwig
@ 2015-11-23 15:35   ` Tejun Heo
  2015-11-24  7:54     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2015-11-23 15:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ilya Dryomov, Alexander Viro, linux-fsdevel

Hello, Christoph.

On Sun, Nov 22, 2015 at 04:02:16PM +0100, Christoph Hellwig wrote:
> The right fix is to never point from the inode to a BDI.  I fixed this
> with a lot of effort, and the BDI writeback series put it back a little
> later.  We need to revert that damage (not neseccarily literally, but the
> effect).

Can you please explain a bit more why inode pointing to its associated
bdi_writeback is a bad idea?  The only alternative would be recording
a key and looking up each time.  We sure can do that but I don't get
why we would want to.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-23 15:35   ` Tejun Heo
@ 2015-11-24  7:54     ` Christoph Hellwig
  2015-11-24 14:00       ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2015-11-24  7:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christoph Hellwig, Ilya Dryomov, Alexander Viro, linux-fsdevel

On Mon, Nov 23, 2015 at 10:35:52AM -0500, Tejun Heo wrote:
> Hello, Christoph.
> 
> On Sun, Nov 22, 2015 at 04:02:16PM +0100, Christoph Hellwig wrote:
> > The right fix is to never point from the inode to a BDI.  I fixed this
> > with a lot of effort, and the BDI writeback series put it back a little
> > later.  We need to revert that damage (not neseccarily literally, but the
> > effect).
> 
> Can you please explain a bit more why inode pointing to its associated
> bdi_writeback is a bad idea?  The only alternative would be recording
> a key and looking up each time.  We sure can do that but I don't get
> why we would want to.

Because the writeback context really is a per-super block concept
(except for the magic block device inodes).  Having another pointer
pointer in the inode (or address_space back then) lead to all kinds
of confusing bugs and life time issues, nevermind massively bloating the
inode for no reason.  But then again bloating the inode has been rather
en vogue lately anyway.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-24  7:54     ` Christoph Hellwig
@ 2015-11-24 14:00       ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2015-11-24 14:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ilya Dryomov, Alexander Viro, linux-fsdevel

Hello, Christoph.

On Tue, Nov 24, 2015 at 08:54:57AM +0100, Christoph Hellwig wrote:
> > Can you please explain a bit more why inode pointing to its associated
> > bdi_writeback is a bad idea?  The only alternative would be recording
> > a key and looking up each time.  We sure can do that but I don't get
> > why we would want to.
> 
> Because the writeback context really is a per-super block concept
> (except for the magic block device inodes).  Having another pointer

With cgroup writeback, inodes belonging to the same superblock can
belong to different writeback domains, so it no longer is strictly
per-super block concept.

> pointer in the inode (or address_space back then) lead to all kinds
> of confusing bugs and life time issues, nevermind massively bloating the

The bdev case is different because a bdev's lifetime doesn't coincide
with the actual object it's representing but the root cause there is
lower layer objects being destroyed while higher layer objects
depending on them are still around.  Severing inode -> lower layer
object dependency may be able to solve some issues but those
approaches are usually pretty brittle.  The right thing to do is
letting objects getting torn down in order by maintaining proper refs.

> inode for no reason.  But then again bloating the inode has been rather
> en vogue lately anyway.

How is it for no reason when there are clearly purposes for them?  If
the extra pointer is problematic, we can try to split sb's so that we
can encode writeback domain information by pointing to
per-writeback-domain split sb's but I'm not sure the added complexity
would be justifiable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-20 21:22 [PATCH] block: detach bdev inode from its wb in __blkdev_put() Ilya Dryomov
  2015-11-20 21:23 ` Tejun Heo
  2015-11-22 15:02 ` Christoph Hellwig
@ 2015-11-30  7:20 ` Raghavendra K T
  2015-11-30 10:54   ` Ilya Dryomov
  2 siblings, 1 reply; 12+ messages in thread
From: Raghavendra K T @ 2015-11-30  7:20 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alexander Viro, Tejun Heo, Christoph Hellwig, linux-fsdevel

* Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:

> Since 52ebea749aae ("writeback: make backing_dev_info host
> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
> gets attached to a wb (struct bdi_writeback).  Detaching happens on
> evict, in inode_detach_wb() called from __destroy_inode(), and involves
> updating wb.
> 
> However, detaching an internal bdev inode from its wb in
> __destroy_inode() is too late.  Its bdi and by extension root wb are
> embedded into struct request_queue, which has different lifetime rules
> and can be freed long before the final bdput() is called (can be from
> __fput() of a corresponding /dev inode, through dput() - evict() -
> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
> opened; as soon as bdev is closed all bets are off.  In fact,
> disk/queue can be gone before __blkdev_put() even returns:
> 
> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
> 1500 {
> ...
> 1518         if (bdev->bd_contains == bdev) {
> 1519                 if (disk->fops->release)
> 1520                         disk->fops->release(disk, mode);
> 
> [ Driver puts its references to disk/queue ]
> 
> 1521         }
> 1522         if (!bdev->bd_openers) {
> 1523                 struct module *owner = disk->fops->owner;
> 1524
> 1525                 disk_put_part(bdev->bd_part);
> 1526                 bdev->bd_part = NULL;
> 1527                 bdev->bd_disk = NULL;
> 1528                 if (bdev != bdev->bd_contains)
> 1529                         victim = bdev->bd_contains;
> 1530                 bdev->bd_contains = NULL;
> 1531
> 1532                 put_disk(disk);
> 
> [ We put ours, the queue is gone
>   The last bdput() would result in a write to invalid memory ]
> 
> 1533                 module_put(owner);
> ...
> 1539 }
> 
> Since bdev inodes are special anyway, detach them in __blkdev_put()
> after clearing inode's dirty bits, turning the problematic
> inode_detach_wb() in __destroy_inode() into a noop.
> 
> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
> gendisk hold a reference to its queue"), so the old ->release comment
> is removed in favor of the new inode_detach_wb() comment.
> 
> Cc: stable@vger.kernel.org # 4.2+, needs backporting
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
Feel free to add 
Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

I was facing bad memory access problem while creating thousands of containers.
With this patch I am able to create more than 10k containers without hitting
the problem.
I had reported the problem here: https://lkml.org/lkml/2015/11/19/149


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-30  7:20 ` Raghavendra K T
@ 2015-11-30 10:54   ` Ilya Dryomov
  2015-12-04 14:07     ` Ilya Dryomov
  0 siblings, 1 reply; 12+ messages in thread
From: Ilya Dryomov @ 2015-11-30 10:54 UTC (permalink / raw)
  To: Raghavendra K T, Jens Axboe, Alexander Viro
  Cc: Tejun Heo, Christoph Hellwig, linux-fsdevel

On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
<raghavendra.kt@linux.vnet.ibm.com> wrote:
> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>
>> Since 52ebea749aae ("writeback: make backing_dev_info host
>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>> updating wb.
>>
>> However, detaching an internal bdev inode from its wb in
>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>> embedded into struct request_queue, which has different lifetime rules
>> and can be freed long before the final bdput() is called (can be from
>> __fput() of a corresponding /dev inode, through dput() - evict() -
>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>> opened; as soon as bdev is closed all bets are off.  In fact,
>> disk/queue can be gone before __blkdev_put() even returns:
>>
>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>> 1500 {
>> ...
>> 1518         if (bdev->bd_contains == bdev) {
>> 1519                 if (disk->fops->release)
>> 1520                         disk->fops->release(disk, mode);
>>
>> [ Driver puts its references to disk/queue ]
>>
>> 1521         }
>> 1522         if (!bdev->bd_openers) {
>> 1523                 struct module *owner = disk->fops->owner;
>> 1524
>> 1525                 disk_put_part(bdev->bd_part);
>> 1526                 bdev->bd_part = NULL;
>> 1527                 bdev->bd_disk = NULL;
>> 1528                 if (bdev != bdev->bd_contains)
>> 1529                         victim = bdev->bd_contains;
>> 1530                 bdev->bd_contains = NULL;
>> 1531
>> 1532                 put_disk(disk);
>>
>> [ We put ours, the queue is gone
>>   The last bdput() would result in a write to invalid memory ]
>>
>> 1533                 module_put(owner);
>> ...
>> 1539 }
>>
>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>> after clearing inode's dirty bits, turning the problematic
>> inode_detach_wb() in __destroy_inode() into a noop.
>>
>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>> gendisk hold a reference to its queue"), so the old ->release comment
>> is removed in favor of the new inode_detach_wb() comment.
>>
>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>> ---
> Feel free to add
> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> I was facing bad memory access problem while creating thousands of containers.
> With this patch I am able to create more than 10k containers without hitting
> the problem.
> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149

Great!  Christoph's concern is with ->i_wb as a whole, not this
particular patch - Al, this one is marked for stable, can we get it
merged into -rc4?  Or should it go through Jens' tree, as cgroup
writeback patches did?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-11-30 10:54   ` Ilya Dryomov
@ 2015-12-04 14:07     ` Ilya Dryomov
  2015-12-04 17:44       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Ilya Dryomov @ 2015-12-04 14:07 UTC (permalink / raw)
  To: Jens Axboe, Alexander Viro
  Cc: Tejun Heo, Christoph Hellwig, linux-fsdevel, Raghavendra K T

On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>
>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>> updating wb.
>>>
>>> However, detaching an internal bdev inode from its wb in
>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>> embedded into struct request_queue, which has different lifetime rules
>>> and can be freed long before the final bdput() is called (can be from
>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>> disk/queue can be gone before __blkdev_put() even returns:
>>>
>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>>> 1500 {
>>> ...
>>> 1518         if (bdev->bd_contains == bdev) {
>>> 1519                 if (disk->fops->release)
>>> 1520                         disk->fops->release(disk, mode);
>>>
>>> [ Driver puts its references to disk/queue ]
>>>
>>> 1521         }
>>> 1522         if (!bdev->bd_openers) {
>>> 1523                 struct module *owner = disk->fops->owner;
>>> 1524
>>> 1525                 disk_put_part(bdev->bd_part);
>>> 1526                 bdev->bd_part = NULL;
>>> 1527                 bdev->bd_disk = NULL;
>>> 1528                 if (bdev != bdev->bd_contains)
>>> 1529                         victim = bdev->bd_contains;
>>> 1530                 bdev->bd_contains = NULL;
>>> 1531
>>> 1532                 put_disk(disk);
>>>
>>> [ We put ours, the queue is gone
>>>   The last bdput() would result in a write to invalid memory ]
>>>
>>> 1533                 module_put(owner);
>>> ...
>>> 1539 }
>>>
>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>> after clearing inode's dirty bits, turning the problematic
>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>
>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>> gendisk hold a reference to its queue"), so the old ->release comment
>>> is removed in favor of the new inode_detach_wb() comment.
>>>
>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>> ---
>> Feel free to add
>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> I was facing bad memory access problem while creating thousands of containers.
>> With this patch I am able to create more than 10k containers without hitting
>> the problem.
>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>
> Great!  Christoph's concern is with ->i_wb as a whole, not this
> particular patch - Al, this one is marked for stable, can we get it
> merged into -rc4?  Or should it go through Jens' tree, as cgroup
> writeback patches did?

Ping?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-12-04 14:07     ` Ilya Dryomov
@ 2015-12-04 17:44       ` Jens Axboe
  2015-12-04 17:55         ` Ilya Dryomov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2015-12-04 17:44 UTC (permalink / raw)
  To: Ilya Dryomov, Alexander Viro
  Cc: Tejun Heo, Christoph Hellwig, linux-fsdevel, Raghavendra K T

On 12/04/2015 07:07 AM, Ilya Dryomov wrote:
> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
>> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>>
>>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>>> updating wb.
>>>>
>>>> However, detaching an internal bdev inode from its wb in
>>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>>> embedded into struct request_queue, which has different lifetime rules
>>>> and can be freed long before the final bdput() is called (can be from
>>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>>> disk/queue can be gone before __blkdev_put() even returns:
>>>>
>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>>>> 1500 {
>>>> ...
>>>> 1518         if (bdev->bd_contains == bdev) {
>>>> 1519                 if (disk->fops->release)
>>>> 1520                         disk->fops->release(disk, mode);
>>>>
>>>> [ Driver puts its references to disk/queue ]
>>>>
>>>> 1521         }
>>>> 1522         if (!bdev->bd_openers) {
>>>> 1523                 struct module *owner = disk->fops->owner;
>>>> 1524
>>>> 1525                 disk_put_part(bdev->bd_part);
>>>> 1526                 bdev->bd_part = NULL;
>>>> 1527                 bdev->bd_disk = NULL;
>>>> 1528                 if (bdev != bdev->bd_contains)
>>>> 1529                         victim = bdev->bd_contains;
>>>> 1530                 bdev->bd_contains = NULL;
>>>> 1531
>>>> 1532                 put_disk(disk);
>>>>
>>>> [ We put ours, the queue is gone
>>>>    The last bdput() would result in a write to invalid memory ]
>>>>
>>>> 1533                 module_put(owner);
>>>> ...
>>>> 1539 }
>>>>
>>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>>> after clearing inode's dirty bits, turning the problematic
>>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>>
>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>>> gendisk hold a reference to its queue"), so the old ->release comment
>>>> is removed in favor of the new inode_detach_wb() comment.
>>>>
>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>>> ---
>>> Feel free to add
>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>
>>> I was facing bad memory access problem while creating thousands of containers.
>>> With this patch I am able to create more than 10k containers without hitting
>>> the problem.
>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>>
>> Great!  Christoph's concern is with ->i_wb as a whole, not this
>> particular patch - Al, this one is marked for stable, can we get it
>> merged into -rc4?  Or should it go through Jens' tree, as cgroup
>> writeback patches did?
>
> Ping?

Was holding off, but I think we should get the simple fix in for 4.4. 
I've applied it for this series.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-12-04 17:44       ` Jens Axboe
@ 2015-12-04 17:55         ` Ilya Dryomov
  2015-12-04 18:02           ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Ilya Dryomov @ 2015-12-04 17:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Viro, Tejun Heo, Christoph Hellwig, linux-fsdevel,
	Raghavendra K T

On Fri, Dec 4, 2015 at 6:44 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 12/04/2015 07:07 AM, Ilya Dryomov wrote:
>>
>> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>
>>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
>>> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>
>>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>>>
>>>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>>>> updating wb.
>>>>>
>>>>> However, detaching an internal bdev inode from its wb in
>>>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>>>> embedded into struct request_queue, which has different lifetime rules
>>>>> and can be freed long before the final bdput() is called (can be from
>>>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>>>> disk/queue can be gone before __blkdev_put() even returns:
>>>>>
>>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode,
>>>>> int for_part)
>>>>> 1500 {
>>>>> ...
>>>>> 1518         if (bdev->bd_contains == bdev) {
>>>>> 1519                 if (disk->fops->release)
>>>>> 1520                         disk->fops->release(disk, mode);
>>>>>
>>>>> [ Driver puts its references to disk/queue ]
>>>>>
>>>>> 1521         }
>>>>> 1522         if (!bdev->bd_openers) {
>>>>> 1523                 struct module *owner = disk->fops->owner;
>>>>> 1524
>>>>> 1525                 disk_put_part(bdev->bd_part);
>>>>> 1526                 bdev->bd_part = NULL;
>>>>> 1527                 bdev->bd_disk = NULL;
>>>>> 1528                 if (bdev != bdev->bd_contains)
>>>>> 1529                         victim = bdev->bd_contains;
>>>>> 1530                 bdev->bd_contains = NULL;
>>>>> 1531
>>>>> 1532                 put_disk(disk);
>>>>>
>>>>> [ We put ours, the queue is gone
>>>>>    The last bdput() would result in a write to invalid memory ]
>>>>>
>>>>> 1533                 module_put(owner);
>>>>> ...
>>>>> 1539 }
>>>>>
>>>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>>>> after clearing inode's dirty bits, turning the problematic
>>>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>>>
>>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>>>> gendisk hold a reference to its queue"), so the old ->release comment
>>>>> is removed in favor of the new inode_detach_wb() comment.
>>>>>
>>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>>>> ---
>>>>
>>>> Feel free to add
>>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>
>>>> I was facing bad memory access problem while creating thousands of
>>>> containers.
>>>> With this patch I am able to create more than 10k containers without
>>>> hitting
>>>> the problem.
>>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>>>
>>>
>>> Great!  Christoph's concern is with ->i_wb as a whole, not this
>>> particular patch - Al, this one is marked for stable, can we get it
>>> merged into -rc4?  Or should it go through Jens' tree, as cgroup
>>> writeback patches did?
>>
>>
>> Ping?
>
>
> Was holding off, but I think we should get the simple fix in for 4.4. I've
> applied it for this series.

I think you missed Raghavendra's Tested-by (I'm looking at
https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=f2148d49930497c66d38b13f137c30d78c60da3d).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block: detach bdev inode from its wb in __blkdev_put()
  2015-12-04 17:55         ` Ilya Dryomov
@ 2015-12-04 18:02           ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2015-12-04 18:02 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Alexander Viro, Tejun Heo, Christoph Hellwig, linux-fsdevel,
	Raghavendra K T

On 12/04/2015 10:55 AM, Ilya Dryomov wrote:
> On Fri, Dec 4, 2015 at 6:44 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 12/04/2015 07:07 AM, Ilya Dryomov wrote:
>>>
>>> On Mon, Nov 30, 2015 at 11:54 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>
>>>> On Mon, Nov 30, 2015 at 8:20 AM, Raghavendra K T
>>>> <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>>>>
>>>>> * Ilya Dryomov <idryomov@gmail.com> [2015-11-20 22:22:34]:
>>>>>
>>>>>> Since 52ebea749aae ("writeback: make backing_dev_info host
>>>>>> cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
>>>>>> gets attached to a wb (struct bdi_writeback).  Detaching happens on
>>>>>> evict, in inode_detach_wb() called from __destroy_inode(), and involves
>>>>>> updating wb.
>>>>>>
>>>>>> However, detaching an internal bdev inode from its wb in
>>>>>> __destroy_inode() is too late.  Its bdi and by extension root wb are
>>>>>> embedded into struct request_queue, which has different lifetime rules
>>>>>> and can be freed long before the final bdput() is called (can be from
>>>>>> __fput() of a corresponding /dev inode, through dput() - evict() -
>>>>>> bd_forget().  bdevs hold onto the underlying disk/queue pair only while
>>>>>> opened; as soon as bdev is closed all bets are off.  In fact,
>>>>>> disk/queue can be gone before __blkdev_put() even returns:
>>>>>>
>>>>>> 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode,
>>>>>> int for_part)
>>>>>> 1500 {
>>>>>> ...
>>>>>> 1518         if (bdev->bd_contains == bdev) {
>>>>>> 1519                 if (disk->fops->release)
>>>>>> 1520                         disk->fops->release(disk, mode);
>>>>>>
>>>>>> [ Driver puts its references to disk/queue ]
>>>>>>
>>>>>> 1521         }
>>>>>> 1522         if (!bdev->bd_openers) {
>>>>>> 1523                 struct module *owner = disk->fops->owner;
>>>>>> 1524
>>>>>> 1525                 disk_put_part(bdev->bd_part);
>>>>>> 1526                 bdev->bd_part = NULL;
>>>>>> 1527                 bdev->bd_disk = NULL;
>>>>>> 1528                 if (bdev != bdev->bd_contains)
>>>>>> 1529                         victim = bdev->bd_contains;
>>>>>> 1530                 bdev->bd_contains = NULL;
>>>>>> 1531
>>>>>> 1532                 put_disk(disk);
>>>>>>
>>>>>> [ We put ours, the queue is gone
>>>>>>     The last bdput() would result in a write to invalid memory ]
>>>>>>
>>>>>> 1533                 module_put(owner);
>>>>>> ...
>>>>>> 1539 }
>>>>>>
>>>>>> Since bdev inodes are special anyway, detach them in __blkdev_put()
>>>>>> after clearing inode's dirty bits, turning the problematic
>>>>>> inode_detach_wb() in __destroy_inode() into a noop.
>>>>>>
>>>>>> add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
>>>>>> gendisk hold a reference to its queue"), so the old ->release comment
>>>>>> is removed in favor of the new inode_detach_wb() comment.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org # 4.2+, needs backporting
>>>>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>>>>> ---
>>>>>
>>>>> Feel free to add
>>>>> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>>>
>>>>> I was facing bad memory access problem while creating thousands of
>>>>> containers.
>>>>> With this patch I am able to create more than 10k containers without
>>>>> hitting
>>>>> the problem.
>>>>> I had reported the problem here: https://lkml.org/lkml/2015/11/19/149
>>>>
>>>>
>>>> Great!  Christoph's concern is with ->i_wb as a whole, not this
>>>> particular patch - Al, this one is marked for stable, can we get it
>>>> merged into -rc4?  Or should it go through Jens' tree, as cgroup
>>>> writeback patches did?
>>>
>>>
>>> Ping?
>>
>>
>> Was holding off, but I think we should get the simple fix in for 4.4. I've
>> applied it for this series.
>
> I think you missed Raghavendra's Tested-by (I'm looking at
> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=f2148d49930497c66d38b13f137c30d78c60da3d).

I did, fixed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-12-04 18:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 21:22 [PATCH] block: detach bdev inode from its wb in __blkdev_put() Ilya Dryomov
2015-11-20 21:23 ` Tejun Heo
2015-11-22 15:02 ` Christoph Hellwig
2015-11-23 15:35   ` Tejun Heo
2015-11-24  7:54     ` Christoph Hellwig
2015-11-24 14:00       ` Tejun Heo
2015-11-30  7:20 ` Raghavendra K T
2015-11-30 10:54   ` Ilya Dryomov
2015-12-04 14:07     ` Ilya Dryomov
2015-12-04 17:44       ` Jens Axboe
2015-12-04 17:55         ` Ilya Dryomov
2015-12-04 18:02           ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.