[PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
@ 2016-11-26  2:06 ` Wei Fang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Fang @ 2016-11-26  2:06 UTC (permalink / raw)
  To: akpm; +Cc: jack, hannes, hch, linux-mm, Wei Fang, stable

->bd_disk is assigned to NULL in __blkdev_put() when no one is holding
the bdev. After that, ->bd_inode still can be touched in the
blockdev_superblock->s_inodes list before the final iput. So iterate_bdevs()
can still get this inode, and start writeback on mapping dirty pages.
->bd_disk will be dereferenced in mapping_cap_writeback_dirty() in this
case, and a NULL dereference crash will be triggered:

Unable to handle kernel NULL pointer dereference at virtual address 00000388
...
[<ffff8000004cb1e4>] blk_get_backing_dev_info+0x1c/0x28
[<ffff8000001c879c>] __filemap_fdatawrite_range+0x54/0x98
[<ffff8000001c8804>] filemap_fdatawrite+0x24/0x2c
[<ffff80000027e7a4>] fdatawrite_one_bdev+0x20/0x28
[<ffff800000288b44>] iterate_bdevs+0xec/0x144
[<ffff80000027eb50>] sys_sync+0x84/0xd0

Since mapping_cap_writeback_dirty() is always return true about
block device inodes, no need to check it if the inode is a block
device inode.

Cc: stable@vger.kernel.org
Signed-off-by: Wei Fang <fangwei1@huawei.com>
---
 mm/filemap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 235021e..d607677 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 		.range_end = end,
 	};
 
-	if (!mapping_cap_writeback_dirty(mapping))
-		return 0;
+	if (!sb_is_blkdev_sb(mapping->host->i_sb))
+		if (!mapping_cap_writeback_dirty(mapping))
+			return 0;
 
 	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
 	ret = do_writepages(mapping, &wbc);
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
@ 2016-11-26  2:06 ` Wei Fang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Fang @ 2016-11-26  2:06 UTC (permalink / raw)
  To: akpm; +Cc: jack, hannes, hch, linux-mm, Wei Fang, stable

->bd_disk is assigned to NULL in __blkdev_put() when no one is holding
the bdev. After that, ->bd_inode still can be touched in the
blockdev_superblock->s_inodes list before the final iput. So iterate_bdevs()
can still get this inode, and start writeback on mapping dirty pages.
->bd_disk will be dereferenced in mapping_cap_writeback_dirty() in this
case, and a NULL dereference crash will be triggered:

Unable to handle kernel NULL pointer dereference at virtual address 00000388
...
[<ffff8000004cb1e4>] blk_get_backing_dev_info+0x1c/0x28
[<ffff8000001c879c>] __filemap_fdatawrite_range+0x54/0x98
[<ffff8000001c8804>] filemap_fdatawrite+0x24/0x2c
[<ffff80000027e7a4>] fdatawrite_one_bdev+0x20/0x28
[<ffff800000288b44>] iterate_bdevs+0xec/0x144
[<ffff80000027eb50>] sys_sync+0x84/0xd0

Since mapping_cap_writeback_dirty() is always return true about
block device inodes, no need to check it if the inode is a block
device inode.

Cc: stable@vger.kernel.org
Signed-off-by: Wei Fang <fangwei1@huawei.com>
---
 mm/filemap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 235021e..d607677 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 		.range_end = end,
 	};
 
-	if (!mapping_cap_writeback_dirty(mapping))
-		return 0;
+	if (!sb_is_blkdev_sb(mapping->host->i_sb))
+		if (!mapping_cap_writeback_dirty(mapping))
+			return 0;
 
 	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
 	ret = do_writepages(mapping, &wbc);
-- 
2.4.11

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-26  2:06 ` Wei Fang
  (?)
@ 2016-11-28 10:07 ` Jan Kara
  2016-11-28 15:57   ` Tejun Heo
  2016-11-29  1:58     ` Wei Fang
  -1 siblings, 2 replies; 17+ messages in thread
From: Jan Kara @ 2016-11-28 10:07 UTC (permalink / raw)
  To: Wei Fang; +Cc: akpm, jack, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

On Sat 26-11-16 10:06:22, Wei Fang wrote:
> ->bd_disk is assigned to NULL in __blkdev_put() when no one is holding
> the bdev. After that, ->bd_inode still can be touched in the
> blockdev_superblock->s_inodes list before the final iput. So iterate_bdevs()
> can still get this inode, and start writeback on mapping dirty pages.
> ->bd_disk will be dereferenced in mapping_cap_writeback_dirty() in this
> case, and a NULL dereference crash will be triggered:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000388
> ...
> [<ffff8000004cb1e4>] blk_get_backing_dev_info+0x1c/0x28
> [<ffff8000001c879c>] __filemap_fdatawrite_range+0x54/0x98
> [<ffff8000001c8804>] filemap_fdatawrite+0x24/0x2c
> [<ffff80000027e7a4>] fdatawrite_one_bdev+0x20/0x28
> [<ffff800000288b44>] iterate_bdevs+0xec/0x144
> [<ffff80000027eb50>] sys_sync+0x84/0xd0
> 
> Since mapping_cap_writeback_dirty() is always return true about
> block device inodes, no need to check it if the inode is a block
> device inode.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Wei Fang <fangwei1@huawei.com>

Good catch but I don't like sprinkling checks like this into the writeback
code and furthermore we don't want to call into writeback code when block
device is in the process of being destroyed which is what would happen with
your patch. That is a bug waiting to happen...

As I'm looking into the code, we need a serialization between bdev writeback
and blkdev_put(). That should be doable if we use writeback_single_inode()
for writing bdev inode instead of simple filemap_fdatawrite() and then use
inode_wait_for_writeback() in blkdev_put() but it needs some careful
thought.

Frankly that whole idea of tearing block devices down on last close is a
major headache and keeps biting us. I'm wondering whether it is still worth
it these days...

								Honza

> ---
>  mm/filemap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 235021e..d607677 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  		.range_end = end,
>  	};
>  
> -	if (!mapping_cap_writeback_dirty(mapping))
> -		return 0;
> +	if (!sb_is_blkdev_sb(mapping->host->i_sb))
> +		if (!mapping_cap_writeback_dirty(mapping))
> +			return 0;
>  
>  	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
>  	ret = do_writepages(mapping, &wbc);
> -- 
> 2.4.11
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-28 10:07 ` Jan Kara
@ 2016-11-28 15:57   ` Tejun Heo
  2016-11-29  9:30     ` Jan Kara
  2016-11-29  1:58     ` Wei Fang
  1 sibling, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2016-11-28 15:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wei Fang, akpm, hannes, hch, linux-mm, stable, Jens Axboe

Hello, Jan.

On Mon, Nov 28, 2016 at 11:07:18AM +0100, Jan Kara wrote:
> As I'm looking into the code, we need a serialization between bdev writeback
> and blkdev_put(). That should be doable if we use writeback_single_inode()
> for writing bdev inode instead of simple filemap_fdatawrite() and then use
> inode_wait_for_writeback() in blkdev_put() but it needs some careful
> thought.

It's kinda weird that sync() is ends up accessing bdev's without any
synchronization.  Can't we just make iterate_bdevs() grab bd_mutex and
verify bd_disk isn't NULL before calling into the callback?

> Frankly that whole idea of tearing block devices down on last close is a
> major headache and keeps biting us. I'm wondering whether it is still worth
> it these days...

Yeah, it'd be great if we can follow a more conventional lifetime
pattern here.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-28 10:07 ` Jan Kara
@ 2016-11-29  1:58     ` Wei Fang
  2016-11-29  1:58     ` Wei Fang
  1 sibling, 0 replies; 17+ messages in thread
From: Wei Fang @ 2016-11-29  1:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

Hi, Jan,

On 2016/11/28 18:07, Jan Kara wrote:
> Good catch but I don't like sprinkling checks like this into the writeback
> code and furthermore we don't want to call into writeback code when block
> device is in the process of being destroyed which is what would happen with
> your patch. That is a bug waiting to happen...

Agreed. Need another way to fix this problem. I looked through the
writeback cgroup code in __filemap_fdatawrite_range(), found if we
turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.

Thanks,
Wei

> As I'm looking into the code, we need a serialization between bdev writeback
> and blkdev_put(). That should be doable if we use writeback_single_inode()
> for writing bdev inode instead of simple filemap_fdatawrite() and then use
> inode_wait_for_writeback() in blkdev_put() but it needs some careful
> thought.
> 
> Frankly that whole idea of tearing block devices down on last close is a
> major headache and keeps biting us. I'm wondering whether it is still worth
> it these days...
> 
> 								Honza
> 
>> ---
>>  mm/filemap.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 235021e..d607677 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>>  		.range_end = end,
>>  	};
>>  
>> -	if (!mapping_cap_writeback_dirty(mapping))
>> -		return 0;
>> +	if (!sb_is_blkdev_sb(mapping->host->i_sb))
>> +		if (!mapping_cap_writeback_dirty(mapping))
>> +			return 0;
>>  
>>  	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
>>  	ret = do_writepages(mapping, &wbc);
>> -- 
>> 2.4.11
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
@ 2016-11-29  1:58     ` Wei Fang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Fang @ 2016-11-29  1:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

Hi, Jan,

On 2016/11/28 18:07, Jan Kara wrote:
> Good catch but I don't like sprinkling checks like this into the writeback
> code and furthermore we don't want to call into writeback code when block
> device is in the process of being destroyed which is what would happen with
> your patch. That is a bug waiting to happen...

Agreed. Need another way to fix this problem. I looked through the
writeback cgroup code in __filemap_fdatawrite_range(), found if we
turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.

Thanks,
Wei

> As I'm looking into the code, we need a serialization between bdev writeback
> and blkdev_put(). That should be doable if we use writeback_single_inode()
> for writing bdev inode instead of simple filemap_fdatawrite() and then use
> inode_wait_for_writeback() in blkdev_put() but it needs some careful
> thought.
> 
> Frankly that whole idea of tearing block devices down on last close is a
> major headache and keeps biting us. I'm wondering whether it is still worth
> it these days...
> 
> 								Honza
> 
>> ---
>>  mm/filemap.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 235021e..d607677 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>>  		.range_end = end,
>>  	};
>>  
>> -	if (!mapping_cap_writeback_dirty(mapping))
>> -		return 0;
>> +	if (!sb_is_blkdev_sb(mapping->host->i_sb))
>> +		if (!mapping_cap_writeback_dirty(mapping))
>> +			return 0;
>>  
>>  	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
>>  	ret = do_writepages(mapping, &wbc);
>> -- 
>> 2.4.11
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-28 15:57   ` Tejun Heo
@ 2016-11-29  9:30     ` Jan Kara
  2016-11-29 16:43       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2016-11-29  9:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Wei Fang, akpm, hannes, hch, linux-mm, stable, Jens Axboe

On Mon 28-11-16 10:57:18, Tejun Heo wrote:
> Hello, Jan.
> 
> On Mon, Nov 28, 2016 at 11:07:18AM +0100, Jan Kara wrote:
> > As I'm looking into the code, we need a serialization between bdev writeback
> > and blkdev_put(). That should be doable if we use writeback_single_inode()
> > for writing bdev inode instead of simple filemap_fdatawrite() and then use
> > inode_wait_for_writeback() in blkdev_put() but it needs some careful
> > thought.
> 
> It's kinda weird that sync() is ends up accessing bdev's without any
> synchronization.  Can't we just make iterate_bdevs() grab bd_mutex and
> verify bd_disk isn't NULL before calling into the callback?

This reminded me I've already seen something like this and indeed I've
already had a very similar discussion in March -
https://patchwork.kernel.org/patch/8556941/

Holding bd_mutex in iterate_devs() works but still nothing protects from
flusher thread just walking across the block device inode and trying to
write it which would result in the very same oops.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-29  9:30     ` Jan Kara
@ 2016-11-29 16:43       ` Tejun Heo
  2016-11-30  9:50         ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2016-11-29 16:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wei Fang, akpm, hannes, hch, linux-mm, stable, Jens Axboe

Hello, Jan.

On Tue, Nov 29, 2016 at 10:30:35AM +0100, Jan Kara wrote:
> > It's kinda weird that sync() is ends up accessing bdev's without any
> > synchronization.  Can't we just make iterate_bdevs() grab bd_mutex and
> > verify bd_disk isn't NULL before calling into the callback?
> 
> This reminded me I've already seen something like this and indeed I've
> already had a very similar discussion in March -
> https://patchwork.kernel.org/patch/8556941/

lol

> Holding bd_mutex in iterate_devs() works but still nothing protects from
> flusher thread just walking across the block device inode and trying to
> write it which would result in the very same oops.

Ah, right.  We aren't implementing either sever or refcnted draining
semantics properly.  I wonder whether we'd be able to retire the inode
synchronously during blkdev_put.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-26  2:06 ` Wei Fang
@ 2016-11-29 23:08   ` Andrew Morton
  -1 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2016-11-29 23:08 UTC (permalink / raw)
  To: Wei Fang; +Cc: jack, hannes, hch, linux-mm, stable, Al Viro

On Sat, 26 Nov 2016 10:06:22 +0800 Wei Fang <fangwei1@huawei.com> wrote:

> ->bd_disk is assigned to NULL in __blkdev_put() when no one is holding
> the bdev. After that, ->bd_inode still can be touched in the
> blockdev_superblock->s_inodes list before the final iput. So iterate_bdevs()
> can still get this inode, and start writeback on mapping dirty pages.
> ->bd_disk will be dereferenced in mapping_cap_writeback_dirty() in this
> case, and a NULL dereference crash will be triggered:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000388
> ...
> [<ffff8000004cb1e4>] blk_get_backing_dev_info+0x1c/0x28
> [<ffff8000001c879c>] __filemap_fdatawrite_range+0x54/0x98
> [<ffff8000001c8804>] filemap_fdatawrite+0x24/0x2c
> [<ffff80000027e7a4>] fdatawrite_one_bdev+0x20/0x28
> [<ffff800000288b44>] iterate_bdevs+0xec/0x144
> [<ffff80000027eb50>] sys_sync+0x84/0xd0
> 
> Since mapping_cap_writeback_dirty() is always return true about
> block device inodes, no need to check it if the inode is a block
> device inode.
> 
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  		.range_end = end,
>  	};
>  
> -	if (!mapping_cap_writeback_dirty(mapping))
> -		return 0;
> +	if (!sb_is_blkdev_sb(mapping->host->i_sb))
> +		if (!mapping_cap_writeback_dirty(mapping))
> +			return 0;
>  
>  	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
>  	ret = do_writepages(mapping, &wbc);

This seems wrong to me.  If __blkdev_put() has got so deep into the
release process as to be zeroing out ->bd_disk then the blockdev's
inode shouldn't be visible to iterate_bdevs()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
@ 2016-11-29 23:08   ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2016-11-29 23:08 UTC (permalink / raw)
  To: Wei Fang; +Cc: jack, hannes, hch, linux-mm, stable, Al Viro

On Sat, 26 Nov 2016 10:06:22 +0800 Wei Fang <fangwei1@huawei.com> wrote:

> ->bd_disk is assigned to NULL in __blkdev_put() when no one is holding
> the bdev. After that, ->bd_inode still can be touched in the
> blockdev_superblock->s_inodes list before the final iput. So iterate_bdevs()
> can still get this inode, and start writeback on mapping dirty pages.
> ->bd_disk will be dereferenced in mapping_cap_writeback_dirty() in this
> case, and a NULL dereference crash will be triggered:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000388
> ...
> [<ffff8000004cb1e4>] blk_get_backing_dev_info+0x1c/0x28
> [<ffff8000001c879c>] __filemap_fdatawrite_range+0x54/0x98
> [<ffff8000001c8804>] filemap_fdatawrite+0x24/0x2c
> [<ffff80000027e7a4>] fdatawrite_one_bdev+0x20/0x28
> [<ffff800000288b44>] iterate_bdevs+0xec/0x144
> [<ffff80000027eb50>] sys_sync+0x84/0xd0
> 
> Since mapping_cap_writeback_dirty() is always return true about
> block device inodes, no need to check it if the inode is a block
> device inode.
> 
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  		.range_end = end,
>  	};
>  
> -	if (!mapping_cap_writeback_dirty(mapping))
> -		return 0;
> +	if (!sb_is_blkdev_sb(mapping->host->i_sb))
> +		if (!mapping_cap_writeback_dirty(mapping))
> +			return 0;
>  
>  	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
>  	ret = do_writepages(mapping, &wbc);

This seems wrong to me.  If __blkdev_put() has got so deep into the
release process as to be zeroing out ->bd_disk then the blockdev's
inode shouldn't be visible to iterate_bdevs()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-29 23:08   ` Andrew Morton
  (?)
@ 2016-11-30  7:30   ` Jan Kara
  -1 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2016-11-30  7:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wei Fang, jack, hannes, hch, linux-mm, stable, Al Viro

On Tue 29-11-16 15:08:28, Andrew Morton wrote:
> On Sat, 26 Nov 2016 10:06:22 +0800 Wei Fang <fangwei1@huawei.com> wrote:
> 
> > ->bd_disk is assigned to NULL in __blkdev_put() when no one is holding
> > the bdev. After that, ->bd_inode still can be touched in the
> > blockdev_superblock->s_inodes list before the final iput. So iterate_bdevs()
> > can still get this inode, and start writeback on mapping dirty pages.
> > ->bd_disk will be dereferenced in mapping_cap_writeback_dirty() in this
> > case, and a NULL dereference crash will be triggered:
> > 
> > Unable to handle kernel NULL pointer dereference at virtual address 00000388
> > ...
> > [<ffff8000004cb1e4>] blk_get_backing_dev_info+0x1c/0x28
> > [<ffff8000001c879c>] __filemap_fdatawrite_range+0x54/0x98
> > [<ffff8000001c8804>] filemap_fdatawrite+0x24/0x2c
> > [<ffff80000027e7a4>] fdatawrite_one_bdev+0x20/0x28
> > [<ffff800000288b44>] iterate_bdevs+0xec/0x144
> > [<ffff80000027eb50>] sys_sync+0x84/0xd0
> > 
> > Since mapping_cap_writeback_dirty() is always return true about
> > block device inodes, no need to check it if the inode is a block
> > device inode.
> > 
> > ...
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -334,8 +334,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
> >  		.range_end = end,
> >  	};
> >  
> > -	if (!mapping_cap_writeback_dirty(mapping))
> > -		return 0;
> > +	if (!sb_is_blkdev_sb(mapping->host->i_sb))
> > +		if (!mapping_cap_writeback_dirty(mapping))
> > +			return 0;
> >  
> >  	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
> >  	ret = do_writepages(mapping, &wbc);
> 
> This seems wrong to me.  If __blkdev_put() has got so deep into the
> release process as to be zeroing out ->bd_disk then the blockdev's
> inode shouldn't be visible to iterate_bdevs()?

That's the trouble with how block devices currently work. On last close of
the block device, the block device inode is detached from bd_disk and thus
from request_queue & bdi. bd_disk & company gets freed, inode stays (bdev
inode is referenced by inodes representing block device in the filesystem
which are referenced by dentries). This happens asynchronously wrt
iterate_bdevs() and inode_to_bdi() calls in general - any inode_to_bdi()
call on block device inode can oops if it happens to race with
__blkdev_put().  The use of inode_to_bdi() in mapping_cap_writeback_dirty()
from iterate_bdevs() is one such possibility - that is relatively easy to
fix by modifying iterate_bdevs() however it is not so easy to protect in
this way inode_to_bdi() calls in writeback happening periodically from the
flusher work.
								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-29 16:43       ` Tejun Heo
@ 2016-11-30  9:50         ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2016-11-30  9:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Wei Fang, akpm, hannes, hch, linux-mm, stable, Jens Axboe

On Tue 29-11-16 11:43:50, Tejun Heo wrote:
> Hello, Jan.
> 
> On Tue, Nov 29, 2016 at 10:30:35AM +0100, Jan Kara wrote:
> > > It's kinda weird that sync() is ends up accessing bdev's without any
> > > synchronization.  Can't we just make iterate_bdevs() grab bd_mutex and
> > > verify bd_disk isn't NULL before calling into the callback?
> > 
> > This reminded me I've already seen something like this and indeed I've
> > already had a very similar discussion in March -
> > https://patchwork.kernel.org/patch/8556941/
> 
> lol
> 
> > Holding bd_mutex in iterate_devs() works but still nothing protects from
> > flusher thread just walking across the block device inode and trying to
> > write it which would result in the very same oops.
> 
> Ah, right.  We aren't implementing either sever or refcnted draining
> semantics properly.  I wonder whether we'd be able to retire the inode
> synchronously during blkdev_put.

Yeah, I've realized flusher thread is mostly taken care of by the fact that
__blkdev_put() does bdev_write_inode() which waits for I_SYNC to get
cleared and then the inode is clean so writeback code mostly ignores it. It
is fragile but likely it works. So for now I've decided to just push the
patch mentioned above to get at least obvious breakage fixed as playing
with bdev lifetime rules definitely won't be a stable material anyway.

I was also thinking about completely tearing down bdev inode in
__blkdev_put() and it could be doable although hunting all inodes
referencing bdev inode through i_bdev will be tricky. Probably we'll need
i_devices things for block devices back...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-29  1:58     ` Wei Fang
@ 2016-11-30  9:51       ` Jan Kara
  -1 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2016-11-30  9:51 UTC (permalink / raw)
  To: Wei Fang
  Cc: Jan Kara, akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 696 bytes --]

On Tue 29-11-16 09:58:31, Wei Fang wrote:
> Hi, Jan,
> 
> On 2016/11/28 18:07, Jan Kara wrote:
> > Good catch but I don't like sprinkling checks like this into the writeback
> > code and furthermore we don't want to call into writeback code when block
> > device is in the process of being destroyed which is what would happen with
> > your patch. That is a bug waiting to happen...
> 
> Agreed. Need another way to fix this problem. I looked through the
> writeback cgroup code in __filemap_fdatawrite_range(), found if we
> turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.

OK, can you test with attached patch please? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-block-protect-iterate_bdevs-against-concurrent-close.patch --]
[-- Type: text/x-patch, Size: 3901 bytes --]

>From ef10e4f52d2d05982fbeba09e48a4253b5fd1119 Mon Sep 17 00:00:00 2001
From: Rabin Vincent <rabinv@axis.com>
Date: Thu, 10 Mar 2016 13:26:03 +0100
Subject: [PATCH] block: protect iterate_bdevs() against concurrent close

If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
 IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
 RIP: 0010:[<ffffffff81314790>]  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
 RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
 RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
 R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
 FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
 Stack:
  ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
  0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
  ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
 Call Trace:
  [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90
  [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30
  [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20
  [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130
  [<ffffffff811b2763>] sys_sync+0x63/0x90
  [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
  RSP <ffff880009f5fe68>
 CR2: 0000000000000508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: stable@vger.kernel.org
Fixes: 5c0d6b60a0ba46d45020547eacf7199171920935
Reported-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 05b553368bb4..899fa8ccc347 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1950,6 +1950,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 	spin_lock(&blockdev_superblock->s_inode_list_lock);
 	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
+		struct block_device *bdev;
 
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1970,8 +1971,12 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		 */
 		iput(old_inode);
 		old_inode = inode;
+		bdev = I_BDEV(inode);
 
-		func(I_BDEV(inode), arg);
+		mutex_lock(&bdev->bd_mutex);
+		if (bdev->bd_openers)
+			func(bdev, arg);
+		mutex_unlock(&bdev->bd_mutex);
 
 		spin_lock(&blockdev_superblock->s_inode_list_lock);
 	}
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
@ 2016-11-30  9:51       ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2016-11-30  9:51 UTC (permalink / raw)
  To: Wei Fang
  Cc: Jan Kara, akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 696 bytes --]

On Tue 29-11-16 09:58:31, Wei Fang wrote:
> Hi, Jan,
> 
> On 2016/11/28 18:07, Jan Kara wrote:
> > Good catch but I don't like sprinkling checks like this into the writeback
> > code and furthermore we don't want to call into writeback code when block
> > device is in the process of being destroyed which is what would happen with
> > your patch. That is a bug waiting to happen...
> 
> Agreed. Need another way to fix this problem. I looked through the
> writeback cgroup code in __filemap_fdatawrite_range(), found if we
> turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.

OK, can you test with attached patch please? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-block-protect-iterate_bdevs-against-concurrent-close.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-11-30  9:51       ` Jan Kara
@ 2016-12-01  2:30         ` Wei Fang
  -1 siblings, 0 replies; 17+ messages in thread
From: Wei Fang @ 2016-12-01  2:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

Hi, Jan,

On 2016/11/30 17:51, Jan Kara wrote:
> On Tue 29-11-16 09:58:31, Wei Fang wrote:
>> Hi, Jan,
>>
>> On 2016/11/28 18:07, Jan Kara wrote:
>>> Good catch but I don't like sprinkling checks like this into the writeback
>>> code and furthermore we don't want to call into writeback code when block
>>> device is in the process of being destroyed which is what would happen with
>>> your patch. That is a bug waiting to happen...
>>
>> Agreed. Need another way to fix this problem. I looked through the
>> writeback cgroup code in __filemap_fdatawrite_range(), found if we
>> turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.
> 
> OK, can you test with attached patch please? Thanks!

I've tested this patch with linux-next about 2 hours, and all goes well.
Without this patch, kernel crashes in minutes.

Thanks,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
@ 2016-12-01  2:30         ` Wei Fang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Fang @ 2016-12-01  2:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

Hi, Jan,

On 2016/11/30 17:51, Jan Kara wrote:
> On Tue 29-11-16 09:58:31, Wei Fang wrote:
>> Hi, Jan,
>>
>> On 2016/11/28 18:07, Jan Kara wrote:
>>> Good catch but I don't like sprinkling checks like this into the writeback
>>> code and furthermore we don't want to call into writeback code when block
>>> device is in the process of being destroyed which is what would happen with
>>> your patch. That is a bug waiting to happen...
>>
>> Agreed. Need another way to fix this problem. I looked through the
>> writeback cgroup code in __filemap_fdatawrite_range(), found if we
>> turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.
> 
> OK, can you test with attached patch please? Thanks!

I've tested this patch with linux-next about 2 hours, and all goes well.
Without this patch, kernel crashes in minutes.

Thanks,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk
  2016-12-01  2:30         ` Wei Fang
  (?)
@ 2016-12-01  8:18         ` Jan Kara
  -1 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2016-12-01  8:18 UTC (permalink / raw)
  To: Wei Fang
  Cc: Jan Kara, akpm, hannes, hch, linux-mm, stable, Jens Axboe, Tejun Heo

On Thu 01-12-16 10:30:05, Wei Fang wrote:
> On 2016/11/30 17:51, Jan Kara wrote:
> > On Tue 29-11-16 09:58:31, Wei Fang wrote:
> >> Hi, Jan,
> >>
> >> On 2016/11/28 18:07, Jan Kara wrote:
> >>> Good catch but I don't like sprinkling checks like this into the writeback
> >>> code and furthermore we don't want to call into writeback code when block
> >>> device is in the process of being destroyed which is what would happen with
> >>> your patch. That is a bug waiting to happen...
> >>
> >> Agreed. Need another way to fix this problem. I looked through the
> >> writeback cgroup code in __filemap_fdatawrite_range(), found if we
> >> turn on CONFIG_CGROUP_WRITEBACK, a new crash will happen.
> > 
> > OK, can you test with attached patch please? Thanks!
> 
> I've tested this patch with linux-next about 2 hours, and all goes well.
> Without this patch, kernel crashes in minutes.

Good. Thanks for testing! I'll send the patch for inclusion.

								hONZA
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-12-01  8:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-26  2:06 [PATCH] mm: Fix a NULL dereference crash while accessing bdev->bd_disk Wei Fang
2016-11-26  2:06 ` Wei Fang
2016-11-28 10:07 ` Jan Kara
2016-11-28 15:57   ` Tejun Heo
2016-11-29  9:30     ` Jan Kara
2016-11-29 16:43       ` Tejun Heo
2016-11-30  9:50         ` Jan Kara
2016-11-29  1:58   ` Wei Fang
2016-11-29  1:58     ` Wei Fang
2016-11-30  9:51     ` Jan Kara
2016-11-30  9:51       ` Jan Kara
2016-12-01  2:30       ` Wei Fang
2016-12-01  2:30         ` Wei Fang
2016-12-01  8:18         ` Jan Kara
2016-11-29 23:08 ` Andrew Morton
2016-11-29 23:08   ` Andrew Morton
2016-11-30  7:30   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.