All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
@ 2019-03-12 13:18 Nikolay Borisov
  2019-03-13 18:03 ` David Sterba
  0 siblings, 1 reply; 4+ messages in thread
From: Nikolay Borisov @ 2019-03-12 13:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

If a an eb fails to be read for whatever reason - it's corrupted on disk
and parent transid/key validations fail or IO for eb pages fail then
this buffer must be removed from the buffer cache. Currently the code
calls free_extent_buffer if an error occurs. Unfortunately this doesn't
achieve the desired behavior since btrfs_find_create_tree_block returns
with eb->refs == 2. On the other hand free_extent_buffer will only
decrement the refs once leavin it added to the buffer cache radix tree.
This enables later code to look up the buffer from the cache and utilize
it potentially leading to a crash.

The correct way to free the buffer is call free_extent_buffer_stale.
This function will correctly call atomic_dec explicitly for the buffer
and subsequently call release_extent_buffer which will decrement the
final reference thus correctly remove the invalid buffer from buffer
cache. This change affects only newly allocated buffers since they have
eb->refs == 2.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
This patch was inspired by Qu's "btrfs: Check the first key and level for
cached extent buffer". Though fixing Qu's crash is only a side effect, what  I 
was aiming for is "correct behavior" - in this case immediately remove an eb 
from the eb cache if it's detected broken. This, however, doesn't eliminate 
the need for Qu's patch which adds the parent check in read_block_for_search.
 
I have validated it it using the image from the bugzilla issue and reading 
/MOUNT/foo/bar/stress/f7. Without my patch (or Qu's) it crashes with either of 
them I get: 

cat: scratch/foo/bar/stress/f7: Input/output error

This also survived a full xfstest run with no regressions. 

 fs/btrfs/disk-io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f7010312d171..03df73de475c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1043,12 +1043,12 @@ int reada_tree_block_flagged(struct btrfs_fs_info *fs_info, u64 bytenr,
 	ret = read_extent_buffer_pages(io_tree, buf, WAIT_PAGE_LOCK,
 				       mirror_num);
 	if (ret) {
-		free_extent_buffer(buf);
+		free_extent_buffer_stale(buf);
 		return ret;
 	}
 
 	if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags)) {
-		free_extent_buffer(buf);
+		free_extent_buffer_stale(buf);
 		return -EIO;
 	} else if (extent_buffer_uptodate(buf)) {
 		*eb = buf;
@@ -1102,7 +1102,7 @@ struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
 	ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid,
 					     level, first_key);
 	if (ret) {
-		free_extent_buffer(buf);
+		free_extent_buffer_stale(buf);
 		return ERR_PTR(ret);
 	}
 	return buf;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
  2019-03-12 13:18 [PATCH] btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails Nikolay Borisov
@ 2019-03-13 18:03 ` David Sterba
  2019-03-14  7:10   ` Nikolay Borisov
  0 siblings, 1 reply; 4+ messages in thread
From: David Sterba @ 2019-03-13 18:03 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Tue, Mar 12, 2019 at 03:18:47PM +0200, Nikolay Borisov wrote:
> If a an eb fails to be read for whatever reason - it's corrupted on disk
> and parent transid/key validations fail or IO for eb pages fail then
> this buffer must be removed from the buffer cache. Currently the code
> calls free_extent_buffer if an error occurs. Unfortunately this doesn't
> achieve the desired behavior since btrfs_find_create_tree_block returns
> with eb->refs == 2. On the other hand free_extent_buffer will only
> decrement the refs once leavin it added to the buffer cache radix tree.
> This enables later code to look up the buffer from the cache and utilize
> it potentially leading to a crash.
> 
> The correct way to free the buffer is call free_extent_buffer_stale.
> This function will correctly call atomic_dec explicitly for the buffer
> and subsequently call release_extent_buffer which will decrement the
> final reference thus correctly remove the invalid buffer from buffer
> cache. This change affects only newly allocated buffers since they have
> eb->refs == 2.

Following the same pattern, should every use of
btrfs_find_create_tree_block be followed by free_extent_buffer_stale in
case of errors?

There's almost the same sequence in readahead_tree_block as you update
in reada_tree_block_flagged:

  btrfs_find_create_tree_block
  read_extent_buffer_pages
  free_extent_buffer

so should be the _stale variant used there too?

free_extent_buffer_stale is normally used in other contexts where the
'stale' part means that the eb is known to be disconnected.

I think that btrfs_find_create_tree_block needs a dedicated freeing
function that will make sure all invariants (refs == 2, no other
structures connected) are met and then do effectively the same what
_stale does now.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
  2019-03-13 18:03 ` David Sterba
@ 2019-03-14  7:10   ` Nikolay Borisov
  2019-03-18 19:01     ` David Sterba
  0 siblings, 1 reply; 4+ messages in thread
From: Nikolay Borisov @ 2019-03-14  7:10 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 13.03.19 г. 20:03 ч., David Sterba wrote:
> On Tue, Mar 12, 2019 at 03:18:47PM +0200, Nikolay Borisov wrote:
>> If a an eb fails to be read for whatever reason - it's corrupted on disk
>> and parent transid/key validations fail or IO for eb pages fail then
>> this buffer must be removed from the buffer cache. Currently the code
>> calls free_extent_buffer if an error occurs. Unfortunately this doesn't
>> achieve the desired behavior since btrfs_find_create_tree_block returns
>> with eb->refs == 2. On the other hand free_extent_buffer will only
>> decrement the refs once leavin it added to the buffer cache radix tree.
>> This enables later code to look up the buffer from the cache and utilize
>> it potentially leading to a crash.
>>
>> The correct way to free the buffer is call free_extent_buffer_stale.
>> This function will correctly call atomic_dec explicitly for the buffer
>> and subsequently call release_extent_buffer which will decrement the
>> final reference thus correctly remove the invalid buffer from buffer
>> cache. This change affects only newly allocated buffers since they have
>> eb->refs == 2.
> 
> Following the same pattern, should every use of
> btrfs_find_create_tree_block be followed by free_extent_buffer_stale in
> case of errors?
> 
> There's almost the same sequence in readahead_tree_block as you update
> in reada_tree_block_flagged:
> 
>   btrfs_find_create_tree_block
>   read_extent_buffer_pages
>   free_extent_buffer
> 
> so should be the _stale variant used there too?

You are right it should also be using the _stale variant. I will fix it.

Despite my cleanup of the extent buffer freeing code I'm still not
entirely happy with the special cases we have with the stale or unmapped
logic. That's why we need all these special cases ;\

> 
> free_extent_buffer_stale is normally used in other contexts where the
> 'stale' part means that the eb is known to be disconnected.
> 
> I think that btrfs_find_create_tree_block needs a dedicated freeing
> function that will make sure all invariants (refs == 2, no other
> structures connected) are met and then do effectively the same what
> _stale does now.
Sounds good though I'm a bit reluctant to add yet another
free_extent_buffer variant. IMO the logic of the extent buffer radix
tree need to be revised which could lead to simplifications in the whole
ref counting code.

> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
  2019-03-14  7:10   ` Nikolay Borisov
@ 2019-03-18 19:01     ` David Sterba
  0 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2019-03-18 19:01 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, linux-btrfs

On Thu, Mar 14, 2019 at 09:10:42AM +0200, Nikolay Borisov wrote:
> >> The correct way to free the buffer is call free_extent_buffer_stale.
> >> This function will correctly call atomic_dec explicitly for the buffer
> >> and subsequently call release_extent_buffer which will decrement the
> >> final reference thus correctly remove the invalid buffer from buffer
> >> cache. This change affects only newly allocated buffers since they have
> >> eb->refs == 2.
> > 
> > Following the same pattern, should every use of
> > btrfs_find_create_tree_block be followed by free_extent_buffer_stale in
> > case of errors?
> > 
> > There's almost the same sequence in readahead_tree_block as you update
> > in reada_tree_block_flagged:
> > 
> >   btrfs_find_create_tree_block
> >   read_extent_buffer_pages
> >   free_extent_buffer
> > 
> > so should be the _stale variant used there too?
> 
> You are right it should also be using the _stale variant. I will fix it.
> 
> Despite my cleanup of the extent buffer freeing code I'm still not
> entirely happy with the special cases we have with the stale or unmapped
> logic. That's why we need all these special cases ;\
> 
> > 
> > free_extent_buffer_stale is normally used in other contexts where the
> > 'stale' part means that the eb is known to be disconnected.
> > 
> > I think that btrfs_find_create_tree_block needs a dedicated freeing
> > function that will make sure all invariants (refs == 2, no other
> > structures connected) are met and then do effectively the same what
> > _stale does now.
> Sounds good though I'm a bit reluctant to add yet another
> free_extent_buffer variant. IMO the logic of the extent buffer radix
> tree need to be revised which could lead to simplifications in the whole
> ref counting code.

Revised maybe, but this is the core of the MM <-> FS interaction and
must be correct, even after the simplification. There's RCU in the mix,
non-standard refcounting, page tracking. This won't be easy. Adding a
admitedly why-yet-another special freeing helper would at least stop
abusing the API and annotate the context. That's an intermediate step.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-03-18 19:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-12 13:18 [PATCH] btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails Nikolay Borisov
2019-03-13 18:03 ` David Sterba
2019-03-14  7:10   ` Nikolay Borisov
2019-03-18 19:01     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.