linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
@ 2020-06-17 16:27 Boris Burkov
  2020-06-17 17:20 ` Filipe Manana
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Boris Burkov @ 2020-06-17 16:27 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-team, Boris Burkov

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

            CPU0                                                         CPU1                                           CPU2
reada_for_search                                            reada_for_search
  readahead_tree_block                                        readahead_tree_block
    find_create_tree_block                                      find_create_tree_block
      alloc_extent_buffer                                         alloc_extent_buffer
                                                                  find_extent_buffer // not found
                                                                  allocates eb
                                                                  lock pages
                                                                  associate pages to eb
                                                                  insert eb into radix tree
                                                                  set TREE_REF, refs == 2
                                                                  unlock pages
                                                              read_extent_buffer_pages // WAIT_NONE
                                                                not uptodate (brand new eb)
                                                                                                            lock_page
                                                                if !trylock_page
                                                                  goto unlock_exit // not an error
                                                              free_extent_buffer
                                                                release_extent_buffer
                                                                  atomic_dec_and_test refs to 1
        find_extent_buffer // found
                                                                                                            try_release_extent_buffer
                                                                                                              take refs_lock
                                                                                                              reads refs == 1; no io
          atomic_inc_not_zero refs to 2
          mark_buffer_accessed
            check_buffer_tree_ref
              // not STALE, won't take refs_lock
              refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
                                                                                                              clear TREE_REF
                                                                                                              release_extent_buffer
                                                                                                                atomic_dec_and_test refs to 1
                                                                                                                unlock_page
      still not uptodate (CPU1 read failed on trylock_page)
      locks pages
      set io_pages > 0
      submit io
      return
    release_extent_buffer
      dec refs to 0
      delete from radix tree
      btrfs_release_extent_buffer_pages
        BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. There is no race in write_one_eb, that we know of, but for
consistency, apply it there too. All the codepaths that clear TREE_REF
check for io, so they would not be able to clear it after this point.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 	num_pages = num_extent_pages(eb);
 	atomic_set(&eb->io_pages, num_pages);
+	/*
+	 * It is possible for releasepage to clear the TREE_REF bit before we
+	 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+	 */
+	check_buffer_tree_ref(eb);
 
 	/* set btree blocks beyond nritems with 0 to avoid stale content. */
 	nritems = btrfs_header_nritems(eb);
@@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
 	int refs;
-	/* the ref bit is tricky.  We have to make sure it is set
-	 * if we have the buffer dirty.   Otherwise the
-	 * code to free a buffer can end up dropping a dirty
-	 * page
+	/*
+	 * The TREE_REF bit is first set when the extent_buffer is added
+	 * to the radix tree. It is also reset, if unset, when a new reference
+	 * is created by find_extent_buffer.
 	 *
-	 * Once the ref bit is set, it won't go away while the
-	 * buffer is dirty or in writeback, and it also won't
-	 * go away while we have the reference count on the
-	 * eb bumped.
+	 * It is only cleared in two cases: freeing the last non-tree
+	 * reference to the extent_buffer when its STALE bit is set or
+	 * calling releasepage when the tree reference is the only reference.
 	 *
-	 * We can't just set the ref bit without bumping the
-	 * ref on the eb because free_extent_buffer might
-	 * see the ref bit and try to clear it.  If this happens
-	 * free_extent_buffer might end up dropping our original
-	 * ref by mistake and freeing the page before we are able
-	 * to add one more ref.
+	 * In both cases, care is taken to ensure that the extent_buffer's
+	 * pages are not under io. However, releasepage can be concurrently
+	 * called with creating new references, which is prone to race
+	 * conditions between the calls to check_tree_ref in those codepaths
+	 * and clearing TREE_REF in try_release_extent_buffer.
 	 *
-	 * So bump the ref count first, then set the bit.  If someone
-	 * beat us to it, drop the ref we added.
+	 * The actual lifetime of the extent_buffer in the radix tree is
+	 * adequately protected by the refcount, but the TREE_REF bit and
+	 * its corresponding reference are not. To protect against this
+	 * class of races, we call check_buffer_tree_ref from the codepaths
+	 * which trigger io after they set eb->io_pages. Note that once io is
+	 * initiated, TREE_REF can no longer be cleared, so that is the
+	 * moment at which any such race is best fixed.
 	 */
 	refs = atomic_read(&eb->refs);
 	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
+	/*
+	 * It is possible for releasepage to clear the TREE_REF bit before we
+	 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+	 */
+	check_buffer_tree_ref(eb);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 16:27 [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race Boris Burkov
@ 2020-06-17 17:20 ` Filipe Manana
  2020-06-17 17:36   ` Filipe Manana
  2020-06-17 17:43   ` Chris Mason
  2020-06-17 21:06 ` [PATCH " kernel test robot
  2020-06-17 22:38 ` kernel test robot
  2 siblings, 2 replies; 12+ messages in thread
From: Filipe Manana @ 2020-06-17 17:20 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote:
>
> Under somewhat convoluted conditions, it is possible to attempt to
> release an extent_buffer that is under io, which triggers a BUG_ON in
> btrfs_release_extent_buffer_pages.
>
> This relies on a few different factors. First, extent_buffer reads done
> as readahead for searching use WAIT_NONE, so they free the local extent
> buffer reference while the io is outstanding. However, they should still
> be protected by TREE_REF. However, if the system is doing signficant
> reclaim, and simultaneously heavily accessing the extent_buffers, it is
> possible for releasepage to race with two concurrent readahead attempts
> in a way that leaves TREE_REF unset when the readahead extent buffer is
> released.
>
> Essentially, if two tasks race to allocate a new extent_buffer, but the
> winner who attempts the first io is rebuffed by a page being locked
> (likely by the reclaim itself) then the loser will still go ahead with
> issuing the readahead. The loser's call to find_extent_buffer must also
> race with the reclaim task reading the extent_buffer's refcount as 1 in
> a way that allows the reclaim to re-clear the TREE_REF checked by
> find_extent_buffer.
>
> The following represents an example execution demonstrating the race:
>
>             CPU0                                                         CPU1                                           CPU2
> reada_for_search                                            reada_for_search
>   readahead_tree_block                                        readahead_tree_block
>     find_create_tree_block                                      find_create_tree_block
>       alloc_extent_buffer                                         alloc_extent_buffer
>                                                                   find_extent_buffer // not found
>                                                                   allocates eb
>                                                                   lock pages
>                                                                   associate pages to eb
>                                                                   insert eb into radix tree
>                                                                   set TREE_REF, refs == 2
>                                                                   unlock pages
>                                                               read_extent_buffer_pages // WAIT_NONE
>                                                                 not uptodate (brand new eb)
>                                                                                                             lock_page
>                                                                 if !trylock_page
>                                                                   goto unlock_exit // not an error
>                                                               free_extent_buffer
>                                                                 release_extent_buffer
>                                                                   atomic_dec_and_test refs to 1
>         find_extent_buffer // found
>                                                                                                             try_release_extent_buffer
>                                                                                                               take refs_lock
>                                                                                                               reads refs == 1; no io
>           atomic_inc_not_zero refs to 2
>           mark_buffer_accessed
>             check_buffer_tree_ref
>               // not STALE, won't take refs_lock
>               refs == 2; TREE_REF set // no action
>     read_extent_buffer_pages // WAIT_NONE
>                                                                                                               clear TREE_REF
>                                                                                                               release_extent_buffer
>                                                                                                                 atomic_dec_and_test refs to 1
>                                                                                                                 unlock_page
>       still not uptodate (CPU1 read failed on trylock_page)
>       locks pages
>       set io_pages > 0
>       submit io
>       return
>     release_extent_buffer

Small detail, missing the call to free_extent_buffer() from
readahead_tree_block(), which is what ends calling
release_extent_buffer().

>       dec refs to 0
>       delete from radix tree
>       btrfs_release_extent_buffer_pages
>         BUG_ON(io_pages > 0)!!!
>
> We observe this at a very low rate in production and were also able to
> reproduce it in a test environment by introducing some spurious delays
> and by introducing probabilistic trylock_page failures.
>
> To fix it, we apply check_tree_ref at a point where it could not
> possibly be unset by a competing task: after io_pages has been
> incremented. There is no race in write_one_eb, that we know of, but for
> consistency, apply it there too. All the codepaths that clear TREE_REF
> check for io, so they would not be able to clear it after this point.
>
> Signed-off-by: Boris Burkov <boris@bur.io>

Btw, if you have a stack trace, it would be good to include it in the
change log for grep-ability in case someone runs into the same
problem.

> ---
>  fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
>  1 file changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c59e07360083..f6758ebbb6a2 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>         clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
>         num_pages = num_extent_pages(eb);
>         atomic_set(&eb->io_pages, num_pages);
> +       /*
> +        * It is possible for releasepage to clear the TREE_REF bit before we
> +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> +        */
> +       check_buffer_tree_ref(eb);

This is a whole different case from the one described in the
changelog, as this is in the write path.
Why do we need this one?

At this point the eb pages are marked with the dirty bit, and
btree_releasepage() returns 0 if the page is dirty and we don't end up
calling try_release_extent_buffer().
So I don't understand why this hunk is needed, what variant of the
problem it's trying to solve.

>
>         /* set btree blocks beyond nritems with 0 to avoid stale content. */
>         nritems = btrfs_header_nritems(eb);
> @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>  static void check_buffer_tree_ref(struct extent_buffer *eb)
>  {
>         int refs;
> -       /* the ref bit is tricky.  We have to make sure it is set
> -        * if we have the buffer dirty.   Otherwise the
> -        * code to free a buffer can end up dropping a dirty
> -        * page
> +       /*
> +        * The TREE_REF bit is first set when the extent_buffer is added
> +        * to the radix tree. It is also reset, if unset, when a new reference
> +        * is created by find_extent_buffer.
>          *
> -        * Once the ref bit is set, it won't go away while the
> -        * buffer is dirty or in writeback, and it also won't
> -        * go away while we have the reference count on the
> -        * eb bumped.
> +        * It is only cleared in two cases: freeing the last non-tree
> +        * reference to the extent_buffer when its STALE bit is set or
> +        * calling releasepage when the tree reference is the only reference.
>          *
> -        * We can't just set the ref bit without bumping the
> -        * ref on the eb because free_extent_buffer might
> -        * see the ref bit and try to clear it.  If this happens
> -        * free_extent_buffer might end up dropping our original
> -        * ref by mistake and freeing the page before we are able
> -        * to add one more ref.
> +        * In both cases, care is taken to ensure that the extent_buffer's
> +        * pages are not under io. However, releasepage can be concurrently
> +        * called with creating new references, which is prone to race
> +        * conditions between the calls to check_tree_ref in those codepaths
> +        * and clearing TREE_REF in try_release_extent_buffer.
>          *
> -        * So bump the ref count first, then set the bit.  If someone
> -        * beat us to it, drop the ref we added.
> +        * The actual lifetime of the extent_buffer in the radix tree is
> +        * adequately protected by the refcount, but the TREE_REF bit and
> +        * its corresponding reference are not. To protect against this
> +        * class of races, we call check_buffer_tree_ref from the codepaths
> +        * which trigger io after they set eb->io_pages. Note that once io is
> +        * initiated, TREE_REF can no longer be cleared, so that is the
> +        * moment at which any such race is best fixed.
>          */
>         refs = atomic_read(&eb->refs);
>         if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
>         clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
>         eb->read_mirror = 0;
>         atomic_set(&eb->io_pages, num_reads);
> +       /*
> +        * It is possible for releasepage to clear the TREE_REF bit before we
> +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> +        */
> +       check_buffer_tree_ref(eb);

This is the hunk that fixes the problem described in the change log,
and it looks good to me.

Thanks.

>         for (i = 0; i < num_pages; i++) {
>                 page = eb->pages[i];
>
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 17:20 ` Filipe Manana
@ 2020-06-17 17:36   ` Filipe Manana
  2020-06-17 17:43   ` Chris Mason
  1 sibling, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2020-06-17 17:36 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote:
> >
> > Under somewhat convoluted conditions, it is possible to attempt to
> > release an extent_buffer that is under io, which triggers a BUG_ON in
> > btrfs_release_extent_buffer_pages.
> >
> > This relies on a few different factors. First, extent_buffer reads done
> > as readahead for searching use WAIT_NONE, so they free the local extent
> > buffer reference while the io is outstanding. However, they should still
> > be protected by TREE_REF. However, if the system is doing signficant
> > reclaim, and simultaneously heavily accessing the extent_buffers, it is
> > possible for releasepage to race with two concurrent readahead attempts
> > in a way that leaves TREE_REF unset when the readahead extent buffer is
> > released.
> >
> > Essentially, if two tasks race to allocate a new extent_buffer, but the
> > winner who attempts the first io is rebuffed by a page being locked
> > (likely by the reclaim itself) then the loser will still go ahead with
> > issuing the readahead. The loser's call to find_extent_buffer must also
> > race with the reclaim task reading the extent_buffer's refcount as 1 in
> > a way that allows the reclaim to re-clear the TREE_REF checked by
> > find_extent_buffer.
> >
> > The following represents an example execution demonstrating the race:
> >
> >             CPU0                                                         CPU1                                           CPU2
> > reada_for_search                                            reada_for_search
> >   readahead_tree_block                                        readahead_tree_block
> >     find_create_tree_block                                      find_create_tree_block
> >       alloc_extent_buffer                                         alloc_extent_buffer
> >                                                                   find_extent_buffer // not found
> >                                                                   allocates eb
> >                                                                   lock pages
> >                                                                   associate pages to eb
> >                                                                   insert eb into radix tree
> >                                                                   set TREE_REF, refs == 2
> >                                                                   unlock pages
> >                                                               read_extent_buffer_pages // WAIT_NONE
> >                                                                 not uptodate (brand new eb)
> >                                                                                                             lock_page
> >                                                                 if !trylock_page
> >                                                                   goto unlock_exit // not an error
> >                                                               free_extent_buffer
> >                                                                 release_extent_buffer
> >                                                                   atomic_dec_and_test refs to 1
> >         find_extent_buffer // found
> >                                                                                                             try_release_extent_buffer
> >                                                                                                               take refs_lock
> >                                                                                                               reads refs == 1; no io
> >           atomic_inc_not_zero refs to 2
> >           mark_buffer_accessed
> >             check_buffer_tree_ref
> >               // not STALE, won't take refs_lock
> >               refs == 2; TREE_REF set // no action
> >     read_extent_buffer_pages // WAIT_NONE
> >                                                                                                               clear TREE_REF
> >                                                                                                               release_extent_buffer
> >                                                                                                                 atomic_dec_and_test refs to 1
> >                                                                                                                 unlock_page
> >       still not uptodate (CPU1 read failed on trylock_page)
> >       locks pages
> >       set io_pages > 0
> >       submit io
> >       return
> >     release_extent_buffer
>
> Small detail, missing the call to free_extent_buffer() from
> readahead_tree_block(), which is what ends calling
> release_extent_buffer().
>
> >       dec refs to 0
> >       delete from radix tree
> >       btrfs_release_extent_buffer_pages
> >         BUG_ON(io_pages > 0)!!!
> >
> > We observe this at a very low rate in production and were also able to
> > reproduce it in a test environment by introducing some spurious delays
> > and by introducing probabilistic trylock_page failures.
> >
> > To fix it, we apply check_tree_ref at a point where it could not
> > possibly be unset by a competing task: after io_pages has been
> > incremented. There is no race in write_one_eb, that we know of, but for
> > consistency, apply it there too. All the codepaths that clear TREE_REF
> > check for io, so they would not be able to clear it after this point.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
>
> Btw, if you have a stack trace, it would be good to include it in the
> change log for grep-ability in case someone runs into the same
> problem.
>
> > ---
> >  fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
> >  1 file changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index c59e07360083..f6758ebbb6a2 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> >         clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >         num_pages = num_extent_pages(eb);
> >         atomic_set(&eb->io_pages, num_pages);
> > +       /*
> > +        * It is possible for releasepage to clear the TREE_REF bit before we
> > +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > +        */
> > +       check_buffer_tree_ref(eb);
>
> This is a whole different case from the one described in the
> changelog, as this is in the write path.
> Why do we need this one?
>
> At this point the eb pages are marked with the dirty bit, and
> btree_releasepage() returns 0 if the page is dirty and we don't end up
> calling try_release_extent_buffer().

In addition to that, when write_one_eb() is called we have incremented
its ref count before and EXTENT_BUFFER_WRITEBACK is set on the eb,
try_release_extent_buffer() should return and never call
release_extent_buffer(), since the refcount is > 1 and
extent_buffer_under_io() returns true.

> So I don't understand why this hunk is needed, what variant of the
> problem it's trying to solve.
>
> >
> >         /* set btree blocks beyond nritems with 0 to avoid stale content. */
> >         nritems = btrfs_header_nritems(eb);
> > @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> >  static void check_buffer_tree_ref(struct extent_buffer *eb)
> >  {
> >         int refs;
> > -       /* the ref bit is tricky.  We have to make sure it is set
> > -        * if we have the buffer dirty.   Otherwise the
> > -        * code to free a buffer can end up dropping a dirty
> > -        * page
> > +       /*
> > +        * The TREE_REF bit is first set when the extent_buffer is added
> > +        * to the radix tree. It is also reset, if unset, when a new reference
> > +        * is created by find_extent_buffer.
> >          *
> > -        * Once the ref bit is set, it won't go away while the
> > -        * buffer is dirty or in writeback, and it also won't
> > -        * go away while we have the reference count on the
> > -        * eb bumped.
> > +        * It is only cleared in two cases: freeing the last non-tree
> > +        * reference to the extent_buffer when its STALE bit is set or
> > +        * calling releasepage when the tree reference is the only reference.
> >          *
> > -        * We can't just set the ref bit without bumping the
> > -        * ref on the eb because free_extent_buffer might
> > -        * see the ref bit and try to clear it.  If this happens
> > -        * free_extent_buffer might end up dropping our original
> > -        * ref by mistake and freeing the page before we are able
> > -        * to add one more ref.
> > +        * In both cases, care is taken to ensure that the extent_buffer's
> > +        * pages are not under io. However, releasepage can be concurrently
> > +        * called with creating new references, which is prone to race
> > +        * conditions between the calls to check_tree_ref in those codepaths
> > +        * and clearing TREE_REF in try_release_extent_buffer.
> >          *
> > -        * So bump the ref count first, then set the bit.  If someone
> > -        * beat us to it, drop the ref we added.
> > +        * The actual lifetime of the extent_buffer in the radix tree is
> > +        * adequately protected by the refcount, but the TREE_REF bit and
> > +        * its corresponding reference are not. To protect against this
> > +        * class of races, we call check_buffer_tree_ref from the codepaths
> > +        * which trigger io after they set eb->io_pages. Note that once io is
> > +        * initiated, TREE_REF can no longer be cleared, so that is the
> > +        * moment at which any such race is best fixed.
> >          */
> >         refs = atomic_read(&eb->refs);
> >         if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
> >         clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> >         eb->read_mirror = 0;
> >         atomic_set(&eb->io_pages, num_reads);
> > +       /*
> > +        * It is possible for releasepage to clear the TREE_REF bit before we
> > +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > +        */
> > +       check_buffer_tree_ref(eb);
>
> This is the hunk that fixes the problem described in the change log,
> and it looks good to me.
>
> Thanks.
>
> >         for (i = 0; i < num_pages; i++) {
> >                 page = eb->pages[i];
> >
> > --
> > 2.24.1
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 17:20 ` Filipe Manana
  2020-06-17 17:36   ` Filipe Manana
@ 2020-06-17 17:43   ` Chris Mason
  2020-06-17 18:11     ` Filipe Manana
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Mason @ 2020-06-17 17:43 UTC (permalink / raw)
  To: Filipe Manana
  Cc: Boris Burkov, Josef Bacik, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On 17 Jun 2020, at 13:20, Filipe Manana wrote:

> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote:
>
>> ---
>>  fs/btrfs/extent_io.c | 45 
>> ++++++++++++++++++++++++++++----------------
>>  1 file changed, 29 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index c59e07360083..f6758ebbb6a2 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int 
>> write_one_eb(struct extent_buffer *eb,
>>         clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
>>         num_pages = num_extent_pages(eb);
>>         atomic_set(&eb->io_pages, num_pages);
>> +       /*
>> +        * It is possible for releasepage to clear the TREE_REF bit 
>> before we
>> +        * set io_pages. See check_buffer_tree_ref for a more 
>> detailed comment.
>> +        */
>> +       check_buffer_tree_ref(eb);
>
> This is a whole different case from the one described in the
> changelog, as this is in the write path.
> Why do we need this one?

This was Josef’s idea, but I really like the symmetry.  You set 
io_pages, you do the tree_ref dance.  Everyone fiddling with the write 
back bit right now correctly clears writeback after doing the atomic_dec 
on io_pages, but the race is tiny and prone to getting exposed again by 
shifting code around.  Tree ref checks around io_pages are the most 
reliable way to prevent this bug from coming back again later.

-chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 17:43   ` Chris Mason
@ 2020-06-17 18:11     ` Filipe Manana
  2020-06-17 18:19       ` Josef Bacik
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2020-06-17 18:11 UTC (permalink / raw)
  To: Chris Mason
  Cc: Boris Burkov, Josef Bacik, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@fb.com> wrote:
>
> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
>
> > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote:
> >
> >> ---
> >>  fs/btrfs/extent_io.c | 45
> >> ++++++++++++++++++++++++++++----------------
> >>  1 file changed, 29 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >> index c59e07360083..f6758ebbb6a2 100644
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
> >> write_one_eb(struct extent_buffer *eb,
> >>         clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >>         num_pages = num_extent_pages(eb);
> >>         atomic_set(&eb->io_pages, num_pages);
> >> +       /*
> >> +        * It is possible for releasepage to clear the TREE_REF bit
> >> before we
> >> +        * set io_pages. See check_buffer_tree_ref for a more
> >> detailed comment.
> >> +        */
> >> +       check_buffer_tree_ref(eb);
> >
> > This is a whole different case from the one described in the
> > changelog, as this is in the write path.
> > Why do we need this one?
>
> This was Josef’s idea, but I really like the symmetry.  You set
> io_pages, you do the tree_ref dance.  Everyone fiddling with the write
> back bit right now correctly clears writeback after doing the atomic_dec
> on io_pages, but the race is tiny and prone to getting exposed again by
> shifting code around.  Tree ref checks around io_pages are the most
> reliable way to prevent this bug from coming back again later.

Ok, but that still doesn't answer my question.
Is there an actual race/problem this hunk solves?

Before calling write_one_eb() we increment the ref on the eb and we
also call lock_extent_buffer_for_io(),
which clears the dirty bit and sets the writeback bit on the eb while
holding its ref_locks spin_lock.

Even if we get to try_release_extent_buffer, it calls
extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
so at any time it should return true, as either the dirty or the
writeback bit is set.

Is this purely a safety guard that is being introduced?

Can we at least describe in the changelog why we are adding this hunk
in the write path?
All it mentions is a race between reading and releasing pages, there's
nothing mentioned about races with writeback.

Thanks.

>
> -chris



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 18:11     ` Filipe Manana
@ 2020-06-17 18:19       ` Josef Bacik
  2020-06-17 18:26         ` Filipe Manana
  0 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2020-06-17 18:19 UTC (permalink / raw)
  To: fdmanana, Chris Mason
  Cc: Boris Burkov, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On 6/17/20 2:11 PM, Filipe Manana wrote:
> On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@fb.com> wrote:
>>
>> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
>>
>>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote:
>>>
>>>> ---
>>>>   fs/btrfs/extent_io.c | 45
>>>> ++++++++++++++++++++++++++++----------------
>>>>   1 file changed, 29 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index c59e07360083..f6758ebbb6a2 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
>>>> write_one_eb(struct extent_buffer *eb,
>>>>          clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
>>>>          num_pages = num_extent_pages(eb);
>>>>          atomic_set(&eb->io_pages, num_pages);
>>>> +       /*
>>>> +        * It is possible for releasepage to clear the TREE_REF bit
>>>> before we
>>>> +        * set io_pages. See check_buffer_tree_ref for a more
>>>> detailed comment.
>>>> +        */
>>>> +       check_buffer_tree_ref(eb);
>>>
>>> This is a whole different case from the one described in the
>>> changelog, as this is in the write path.
>>> Why do we need this one?
>>
>> This was Josef’s idea, but I really like the symmetry.  You set
>> io_pages, you do the tree_ref dance.  Everyone fiddling with the write
>> back bit right now correctly clears writeback after doing the atomic_dec
>> on io_pages, but the race is tiny and prone to getting exposed again by
>> shifting code around.  Tree ref checks around io_pages are the most
>> reliable way to prevent this bug from coming back again later.
> 
> Ok, but that still doesn't answer my question.
> Is there an actual race/problem this hunk solves?
> 
> Before calling write_one_eb() we increment the ref on the eb and we
> also call lock_extent_buffer_for_io(),
> which clears the dirty bit and sets the writeback bit on the eb while
> holding its ref_locks spin_lock.
> 
> Even if we get to try_release_extent_buffer, it calls
> extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
> so at any time it should return true, as either the dirty or the
> writeback bit is set.
> 
> Is this purely a safety guard that is being introduced?
> 
> Can we at least describe in the changelog why we are adding this hunk
> in the write path?
> All it mentions is a race between reading and releasing pages, there's
> nothing mentioned about races with writeback.
> 

I think maybe we make that bit a separate patch, and leave the panic fix on it's 
own.  I suggested this because I have lofty ideas of killing the refs_lock, and 
this would at least keep us consistent in our usage TREE_REF to save us from 
freeing stuff that may be currently under IO.

I'm super not happy with our reference handling coupled with releasepage, but 
these are the kind of hoops we're going to have to jump through until we have 
some better infrastructure in place to handle metadata.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 18:19       ` Josef Bacik
@ 2020-06-17 18:26         ` Filipe Manana
  2020-06-17 18:35           ` [PATCH v2 " Boris Burkov
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2020-06-17 18:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, Boris Burkov, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On Wed, Jun 17, 2020 at 7:19 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 6/17/20 2:11 PM, Filipe Manana wrote:
> > On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@fb.com> wrote:
> >>
> >> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
> >>
> >>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote:
> >>>
> >>>> ---
> >>>>   fs/btrfs/extent_io.c | 45
> >>>> ++++++++++++++++++++++++++++----------------
> >>>>   1 file changed, 29 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >>>> index c59e07360083..f6758ebbb6a2 100644
> >>>> --- a/fs/btrfs/extent_io.c
> >>>> +++ b/fs/btrfs/extent_io.c
> >>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
> >>>> write_one_eb(struct extent_buffer *eb,
> >>>>          clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >>>>          num_pages = num_extent_pages(eb);
> >>>>          atomic_set(&eb->io_pages, num_pages);
> >>>> +       /*
> >>>> +        * It is possible for releasepage to clear the TREE_REF bit
> >>>> before we
> >>>> +        * set io_pages. See check_buffer_tree_ref for a more
> >>>> detailed comment.
> >>>> +        */
> >>>> +       check_buffer_tree_ref(eb);
> >>>
> >>> This is a whole different case from the one described in the
> >>> changelog, as this is in the write path.
> >>> Why do we need this one?
> >>
> >> This was Josef’s idea, but I really like the symmetry.  You set
> >> io_pages, you do the tree_ref dance.  Everyone fiddling with the write
> >> back bit right now correctly clears writeback after doing the atomic_dec
> >> on io_pages, but the race is tiny and prone to getting exposed again by
> >> shifting code around.  Tree ref checks around io_pages are the most
> >> reliable way to prevent this bug from coming back again later.
> >
> > Ok, but that still doesn't answer my question.
> > Is there an actual race/problem this hunk solves?
> >
> > Before calling write_one_eb() we increment the ref on the eb and we
> > also call lock_extent_buffer_for_io(),
> > which clears the dirty bit and sets the writeback bit on the eb while
> > holding its ref_locks spin_lock.
> >
> > Even if we get to try_release_extent_buffer, it calls
> > extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
> > so at any time it should return true, as either the dirty or the
> > writeback bit is set.
> >
> > Is this purely a safety guard that is being introduced?
> >
> > Can we at least describe in the changelog why we are adding this hunk
> > in the write path?
> > All it mentions is a race between reading and releasing pages, there's
> > nothing mentioned about races with writeback.
> >
>
> I think maybe we make that bit a separate patch, and leave the panic fix on it's
> own.  I suggested this because I have lofty ideas of killing the refs_lock, and
> this would at least keep us consistent in our usage TREE_REF to save us from
> freeing stuff that may be currently under IO.
>
> I'm super not happy with our reference handling coupled with releasepage, but
> these are the kind of hoops we're going to have to jump through until we have
> some better infrastructure in place to handle metadata.  Thanks,

Ok, so it's just a safety guard then.
Yes, either a separate patch or at the very least mention why we are
adding that in the change log.

Thanks.

>
> Josef



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 18:26         ` Filipe Manana
@ 2020-06-17 18:35           ` Boris Burkov
  2020-06-17 18:51             ` Filipe Manana
  2020-06-18 14:46             ` David Sterba
  0 siblings, 2 replies; 12+ messages in thread
From: Boris Burkov @ 2020-06-17 18:35 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-team

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

            CPU0                                                         CPU1                                           CPU2
reada_for_search                                            reada_for_search
  readahead_tree_block                                        readahead_tree_block
    find_create_tree_block                                      find_create_tree_block
      alloc_extent_buffer                                         alloc_extent_buffer
                                                                  find_extent_buffer // not found
                                                                  allocates eb
                                                                  lock pages
                                                                  associate pages to eb
                                                                  insert eb into radix tree
                                                                  set TREE_REF, refs == 2
                                                                  unlock pages
                                                              read_extent_buffer_pages // WAIT_NONE
                                                                not uptodate (brand new eb)
                                                                                                            lock_page
                                                                if !trylock_page
                                                                  goto unlock_exit // not an error
                                                              free_extent_buffer
                                                                release_extent_buffer
                                                                  atomic_dec_and_test refs to 1
        find_extent_buffer // found
                                                                                                            try_release_extent_buffer
                                                                                                              take refs_lock
                                                                                                              reads refs == 1; no io
          atomic_inc_not_zero refs to 2
          mark_buffer_accessed
            check_buffer_tree_ref
              // not STALE, won't take refs_lock
              refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
                                                                                                              clear TREE_REF
                                                                                                              release_extent_buffer
                                                                                                                atomic_dec_and_test refs to 1
                                                                                                                unlock_page
      still not uptodate (CPU1 read failed on trylock_page)
      locks pages
      set io_pages > 0
      submit io
      return
    free_extent_buffer
      release_extent_buffer
        dec refs to 0
        delete from radix tree
        btrfs_release_extent_buffer_pages
          BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.

stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX:
0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI:
ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09:
0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12:
0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15:
ffff888797aafd90
[1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000)
knlGS:0000000000000000
[1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4:
00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1417839.731320] Call Trace:
[1417839.737103]  release_extent_buffer+0x39/0x90
[1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645]  btrfs_search_slot+0x260/0x9b0
[1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427]  btrfs_get_extent+0x15f/0x830
[1417839.787665]  ? submit_extent_page+0xc4/0x1c0
[1417839.797474]  ? __do_readpage+0x299/0x7a0
[1417839.806515]  __do_readpage+0x33b/0x7a0
[1417839.815171]  ? btrfs_releasepage+0x70/0x70
[1417839.824597]  extent_readpages+0x28f/0x400
[1417839.833836]  read_pages+0x6a/0x1c0
[1417839.841729]  ? startup_64+0x2/0x30
[1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590]  filemap_fault+0x6c7/0x990
[1417839.869252]  ? xas_load+0x8/0x80
[1417839.876756]  ? xas_find+0x150/0x190
[1417839.884839]  ? filemap_map_pages+0x295/0x3b0
[1417839.894652]  __do_fault+0x32/0x110
[1417839.902540]  __handle_mm_fault+0xacd/0x1000
[1417839.912156]  handle_mm_fault+0xaa/0x1c0
[1417839.921004]  __do_page_fault+0x242/0x4b0
[1417839.930044]  ? page_fault+0x8/0x30
[1417839.937933]  page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX:
0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09:
00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12:
00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15:
00007f5a94aecc40

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..95313bb7fe40 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5086,25 +5086,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
 	int refs;
-	/* the ref bit is tricky.  We have to make sure it is set
-	 * if we have the buffer dirty.   Otherwise the
-	 * code to free a buffer can end up dropping a dirty
-	 * page
+	/*
+	 * The TREE_REF bit is first set when the extent_buffer is added
+	 * to the radix tree. It is also reset, if unset, when a new reference
+	 * is created by find_extent_buffer.
 	 *
-	 * Once the ref bit is set, it won't go away while the
-	 * buffer is dirty or in writeback, and it also won't
-	 * go away while we have the reference count on the
-	 * eb bumped.
+	 * It is only cleared in two cases: freeing the last non-tree
+	 * reference to the extent_buffer when its STALE bit is set or
+	 * calling releasepage when the tree reference is the only reference.
 	 *
-	 * We can't just set the ref bit without bumping the
-	 * ref on the eb because free_extent_buffer might
-	 * see the ref bit and try to clear it.  If this happens
-	 * free_extent_buffer might end up dropping our original
-	 * ref by mistake and freeing the page before we are able
-	 * to add one more ref.
+	 * In both cases, care is taken to ensure that the extent_buffer's
+	 * pages are not under io. However, releasepage can be concurrently
+	 * called with creating new references, which is prone to race
+	 * conditions between the calls to check_buffer_tree_ref in those
+	 * codepaths and clearing TREE_REF in try_release_extent_buffer.
 	 *
-	 * So bump the ref count first, then set the bit.  If someone
-	 * beat us to it, drop the ref we added.
+	 * The actual lifetime of the extent_buffer in the radix tree is
+	 * adequately protected by the refcount, but the TREE_REF bit and
+	 * its corresponding reference are not. To protect against this
+	 * class of races, we call check_buffer_tree_ref from the codepaths
+	 * which trigger io after they set eb->io_pages. Note that once io is
+	 * initiated, TREE_REF can no longer be cleared, so that is the
+	 * moment at which any such race is best fixed.
 	 */
 	refs = atomic_read(&eb->refs);
 	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -5555,6 +5558,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
 	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
+	/*
+	 * It is possible for releasepage to clear the TREE_REF bit before we
+	 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+	 */
+	check_buffer_tree_ref(eb);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 18:35           ` [PATCH v2 " Boris Burkov
@ 2020-06-17 18:51             ` Filipe Manana
  2020-06-18 14:46             ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2020-06-17 18:51 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	Linux Kernel Mailing List, kernel-team

On Wed, Jun 17, 2020 at 7:39 PM Boris Burkov <boris@bur.io> wrote:
>
> Under somewhat convoluted conditions, it is possible to attempt to
> release an extent_buffer that is under io, which triggers a BUG_ON in
> btrfs_release_extent_buffer_pages.
>
> This relies on a few different factors. First, extent_buffer reads done
> as readahead for searching use WAIT_NONE, so they free the local extent
> buffer reference while the io is outstanding. However, they should still
> be protected by TREE_REF. However, if the system is doing signficant
> reclaim, and simultaneously heavily accessing the extent_buffers, it is
> possible for releasepage to race with two concurrent readahead attempts
> in a way that leaves TREE_REF unset when the readahead extent buffer is
> released.
>
> Essentially, if two tasks race to allocate a new extent_buffer, but the
> winner who attempts the first io is rebuffed by a page being locked
> (likely by the reclaim itself) then the loser will still go ahead with
> issuing the readahead. The loser's call to find_extent_buffer must also
> race with the reclaim task reading the extent_buffer's refcount as 1 in
> a way that allows the reclaim to re-clear the TREE_REF checked by
> find_extent_buffer.
>
> The following represents an example execution demonstrating the race:
>
>             CPU0                                                         CPU1                                           CPU2
> reada_for_search                                            reada_for_search
>   readahead_tree_block                                        readahead_tree_block
>     find_create_tree_block                                      find_create_tree_block
>       alloc_extent_buffer                                         alloc_extent_buffer
>                                                                   find_extent_buffer // not found
>                                                                   allocates eb
>                                                                   lock pages
>                                                                   associate pages to eb
>                                                                   insert eb into radix tree
>                                                                   set TREE_REF, refs == 2
>                                                                   unlock pages
>                                                               read_extent_buffer_pages // WAIT_NONE
>                                                                 not uptodate (brand new eb)
>                                                                                                             lock_page
>                                                                 if !trylock_page
>                                                                   goto unlock_exit // not an error
>                                                               free_extent_buffer
>                                                                 release_extent_buffer
>                                                                   atomic_dec_and_test refs to 1
>         find_extent_buffer // found
>                                                                                                             try_release_extent_buffer
>                                                                                                               take refs_lock
>                                                                                                               reads refs == 1; no io
>           atomic_inc_not_zero refs to 2
>           mark_buffer_accessed
>             check_buffer_tree_ref
>               // not STALE, won't take refs_lock
>               refs == 2; TREE_REF set // no action
>     read_extent_buffer_pages // WAIT_NONE
>                                                                                                               clear TREE_REF
>                                                                                                               release_extent_buffer
>                                                                                                                 atomic_dec_and_test refs to 1
>                                                                                                                 unlock_page
>       still not uptodate (CPU1 read failed on trylock_page)
>       locks pages
>       set io_pages > 0
>       submit io
>       return
>     free_extent_buffer
>       release_extent_buffer
>         dec refs to 0
>         delete from radix tree
>         btrfs_release_extent_buffer_pages
>           BUG_ON(io_pages > 0)!!!
>
> We observe this at a very low rate in production and were also able to
> reproduce it in a test environment by introducing some spurious delays
> and by introducing probabilistic trylock_page failures.
>
> To fix it, we apply check_tree_ref at a point where it could not
> possibly be unset by a competing task: after io_pages has been
> incremented. All the codepaths that clear TREE_REF check for io, so they
> would not be able to clear it after this point until the io is done.
>
> stack trace, for reference:
> [1417839.424739] ------------[ cut here ]------------
> [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
> [1417839.447024] invalid opcode: 0000 [#1] SMP
> [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
> [1417839.517008] Code: ed e9 ...
> [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
> [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX:
> 0000000000000028
> [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI:
> ffff888102d6def0
> [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09:
> 0000000000000238
> [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12:
> 0000000000000000
> [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15:
> ffff888797aafd90
> [1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000)
> knlGS:0000000000000000
> [1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4:
> 00000000003606e0
> [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [1417839.731320] Call Trace:
> [1417839.737103]  release_extent_buffer+0x39/0x90
> [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
> [1417839.758645]  btrfs_search_slot+0x260/0x9b0
> [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
> [1417839.778427]  btrfs_get_extent+0x15f/0x830
> [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
> [1417839.797474]  ? __do_readpage+0x299/0x7a0
> [1417839.806515]  __do_readpage+0x33b/0x7a0
> [1417839.815171]  ? btrfs_releasepage+0x70/0x70
> [1417839.824597]  extent_readpages+0x28f/0x400
> [1417839.833836]  read_pages+0x6a/0x1c0
> [1417839.841729]  ? startup_64+0x2/0x30
> [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
> [1417839.860590]  filemap_fault+0x6c7/0x990
> [1417839.869252]  ? xas_load+0x8/0x80
> [1417839.876756]  ? xas_find+0x150/0x190
> [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
> [1417839.894652]  __do_fault+0x32/0x110
> [1417839.902540]  __handle_mm_fault+0xacd/0x1000
> [1417839.912156]  handle_mm_fault+0xaa/0x1c0
> [1417839.921004]  __do_page_fault+0x242/0x4b0
> [1417839.930044]  ? page_fault+0x8/0x30
> [1417839.937933]  page_fault+0x1e/0x30
> [1417839.945631] RIP: 0033:0x33c4bae
> [1417839.952927] Code: Bad RIP value.
> [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
> [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX:
> 0000000000000000
> [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000002
> [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09:
> 00007f5a94aeccb8
> [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12:
> 00007f5a94aecc79
> [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15:
> 00007f5a94aecc40
>
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks!

> ---
>  fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++----------------
>  1 file changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c59e07360083..95313bb7fe40 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5086,25 +5086,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>  static void check_buffer_tree_ref(struct extent_buffer *eb)
>  {
>         int refs;
> -       /* the ref bit is tricky.  We have to make sure it is set
> -        * if we have the buffer dirty.   Otherwise the
> -        * code to free a buffer can end up dropping a dirty
> -        * page
> +       /*
> +        * The TREE_REF bit is first set when the extent_buffer is added
> +        * to the radix tree. It is also reset, if unset, when a new reference
> +        * is created by find_extent_buffer.
>          *
> -        * Once the ref bit is set, it won't go away while the
> -        * buffer is dirty or in writeback, and it also won't
> -        * go away while we have the reference count on the
> -        * eb bumped.
> +        * It is only cleared in two cases: freeing the last non-tree
> +        * reference to the extent_buffer when its STALE bit is set or
> +        * calling releasepage when the tree reference is the only reference.
>          *
> -        * We can't just set the ref bit without bumping the
> -        * ref on the eb because free_extent_buffer might
> -        * see the ref bit and try to clear it.  If this happens
> -        * free_extent_buffer might end up dropping our original
> -        * ref by mistake and freeing the page before we are able
> -        * to add one more ref.
> +        * In both cases, care is taken to ensure that the extent_buffer's
> +        * pages are not under io. However, releasepage can be concurrently
> +        * called with creating new references, which is prone to race
> +        * conditions between the calls to check_buffer_tree_ref in those
> +        * codepaths and clearing TREE_REF in try_release_extent_buffer.
>          *
> -        * So bump the ref count first, then set the bit.  If someone
> -        * beat us to it, drop the ref we added.
> +        * The actual lifetime of the extent_buffer in the radix tree is
> +        * adequately protected by the refcount, but the TREE_REF bit and
> +        * its corresponding reference are not. To protect against this
> +        * class of races, we call check_buffer_tree_ref from the codepaths
> +        * which trigger io after they set eb->io_pages. Note that once io is
> +        * initiated, TREE_REF can no longer be cleared, so that is the
> +        * moment at which any such race is best fixed.
>          */
>         refs = atomic_read(&eb->refs);
>         if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> @@ -5555,6 +5558,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
>         clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
>         eb->read_mirror = 0;
>         atomic_set(&eb->io_pages, num_reads);
> +       /*
> +        * It is possible for releasepage to clear the TREE_REF bit before we
> +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> +        */
> +       check_buffer_tree_ref(eb);
>         for (i = 0; i < num_pages; i++) {
>                 page = eb->pages[i];
>
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 16:27 [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race Boris Burkov
  2020-06-17 17:20 ` Filipe Manana
@ 2020-06-17 21:06 ` kernel test robot
  2020-06-17 22:38 ` kernel test robot
  2 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2020-06-17 21:06 UTC (permalink / raw)
  To: Boris Burkov, Chris Mason, Josef Bacik, David Sterba
  Cc: kbuild-all, linux-btrfs, linux-kernel, kernel-team, Boris Burkov

[-- Attachment #1: Type: text/plain, Size: 4506 bytes --]

Hi Boris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on v5.8-rc1 next-20200617]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-fix-fatal-extent_buffer-readahead-vs-releasepage-race/20200618-002941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-rhel (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
reproduce (this is a W=1 build):
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>, old ones prefixed by <<):

In file included from fs/btrfs/extent_io.c:19:
fs/btrfs/ctree.h:2216:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2216 | size_t __const btrfs_get_num_csums(void);
|        ^~~~~~~
fs/btrfs/extent_io.c: In function 'write_one_eb':
>> fs/btrfs/extent_io.c:3934:2: error: implicit declaration of function 'check_buffer_tree_ref' [-Werror=implicit-function-declaration]
3934 |  check_buffer_tree_ref(eb);
|  ^~~~~~~~~~~~~~~~~~~~~
fs/btrfs/extent_io.c: At top level:
>> fs/btrfs/extent_io.c:5091:13: warning: conflicting types for 'check_buffer_tree_ref'
5091 | static void check_buffer_tree_ref(struct extent_buffer *eb)
|             ^~~~~~~~~~~~~~~~~~~~~
>> fs/btrfs/extent_io.c:5091:13: error: static declaration of 'check_buffer_tree_ref' follows non-static declaration
fs/btrfs/extent_io.c:3934:2: note: previous implicit declaration of 'check_buffer_tree_ref' was here
3934 |  check_buffer_tree_ref(eb);
|  ^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/check_buffer_tree_ref +3934 fs/btrfs/extent_io.c

  3915	
  3916	static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
  3917				struct writeback_control *wbc,
  3918				struct extent_page_data *epd)
  3919	{
  3920		u64 offset = eb->start;
  3921		u32 nritems;
  3922		int i, num_pages;
  3923		unsigned long start, end;
  3924		unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
  3925		int ret = 0;
  3926	
  3927		clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
  3928		num_pages = num_extent_pages(eb);
  3929		atomic_set(&eb->io_pages, num_pages);
  3930		/*
  3931		 * It is possible for releasepage to clear the TREE_REF bit before we
  3932		 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
  3933		 */
> 3934		check_buffer_tree_ref(eb);
  3935	
  3936		/* set btree blocks beyond nritems with 0 to avoid stale content. */
  3937		nritems = btrfs_header_nritems(eb);
  3938		if (btrfs_header_level(eb) > 0) {
  3939			end = btrfs_node_key_ptr_offset(nritems);
  3940	
  3941			memzero_extent_buffer(eb, end, eb->len - end);
  3942		} else {
  3943			/*
  3944			 * leaf:
  3945			 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
  3946			 */
  3947			start = btrfs_item_nr_offset(nritems);
  3948			end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
  3949			memzero_extent_buffer(eb, start, end - start);
  3950		}
  3951	
  3952		for (i = 0; i < num_pages; i++) {
  3953			struct page *p = eb->pages[i];
  3954	
  3955			clear_page_dirty_for_io(p);
  3956			set_page_writeback(p);
  3957			ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
  3958						 p, offset, PAGE_SIZE, 0,
  3959						 &epd->bio,
  3960						 end_bio_extent_buffer_writepage,
  3961						 0, 0, 0, false);
  3962			if (ret) {
  3963				set_btree_ioerr(p);
  3964				if (PageWriteback(p))
  3965					end_page_writeback(p);
  3966				if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
  3967					end_extent_buffer_writeback(eb);
  3968				ret = -EIO;
  3969				break;
  3970			}
  3971			offset += PAGE_SIZE;
  3972			update_nr_written(wbc, 1);
  3973			unlock_page(p);
  3974		}
  3975	
  3976		if (unlikely(ret)) {
  3977			for (; i < num_pages; i++) {
  3978				struct page *p = eb->pages[i];
  3979				clear_page_dirty_for_io(p);
  3980				unlock_page(p);
  3981			}
  3982		}
  3983	
  3984		return ret;
  3985	}
  3986	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 44863 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 16:27 [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race Boris Burkov
  2020-06-17 17:20 ` Filipe Manana
  2020-06-17 21:06 ` [PATCH " kernel test robot
@ 2020-06-17 22:38 ` kernel test robot
  2 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2020-06-17 22:38 UTC (permalink / raw)
  To: Boris Burkov, Chris Mason, Josef Bacik, David Sterba
  Cc: kbuild-all, clang-built-linux, linux-btrfs, linux-kernel,
	kernel-team, Boris Burkov

[-- Attachment #1: Type: text/plain, Size: 4558 bytes --]

Hi Boris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on v5.8-rc1 next-20200617]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-fix-fatal-extent_buffer-readahead-vs-releasepage-race/20200618-002941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-allyesconfig (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 487ca07fcc75d52755c9fe2ee05bcb3b6eeeec44)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

In file included from fs/btrfs/extent_io.c:19:
fs/btrfs/ctree.h:2216:8: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
size_t __const btrfs_get_num_csums(void);
^~~~~~~~
>> fs/btrfs/extent_io.c:3934:2: error: implicit declaration of function 'check_buffer_tree_ref' [-Werror,-Wimplicit-function-declaration]
check_buffer_tree_ref(eb);
^
>> fs/btrfs/extent_io.c:5091:13: error: static declaration of 'check_buffer_tree_ref' follows non-static declaration
static void check_buffer_tree_ref(struct extent_buffer *eb)
^
fs/btrfs/extent_io.c:3934:2: note: previous implicit declaration is here
check_buffer_tree_ref(eb);
^
1 warning and 2 errors generated.

vim +/check_buffer_tree_ref +3934 fs/btrfs/extent_io.c

  3915	
  3916	static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
  3917				struct writeback_control *wbc,
  3918				struct extent_page_data *epd)
  3919	{
  3920		u64 offset = eb->start;
  3921		u32 nritems;
  3922		int i, num_pages;
  3923		unsigned long start, end;
  3924		unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
  3925		int ret = 0;
  3926	
  3927		clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
  3928		num_pages = num_extent_pages(eb);
  3929		atomic_set(&eb->io_pages, num_pages);
  3930		/*
  3931		 * It is possible for releasepage to clear the TREE_REF bit before we
  3932		 * set io_pages. See check_buffer_tree_ref for a more detailed comment.
  3933		 */
> 3934		check_buffer_tree_ref(eb);
  3935	
  3936		/* set btree blocks beyond nritems with 0 to avoid stale content. */
  3937		nritems = btrfs_header_nritems(eb);
  3938		if (btrfs_header_level(eb) > 0) {
  3939			end = btrfs_node_key_ptr_offset(nritems);
  3940	
  3941			memzero_extent_buffer(eb, end, eb->len - end);
  3942		} else {
  3943			/*
  3944			 * leaf:
  3945			 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
  3946			 */
  3947			start = btrfs_item_nr_offset(nritems);
  3948			end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
  3949			memzero_extent_buffer(eb, start, end - start);
  3950		}
  3951	
  3952		for (i = 0; i < num_pages; i++) {
  3953			struct page *p = eb->pages[i];
  3954	
  3955			clear_page_dirty_for_io(p);
  3956			set_page_writeback(p);
  3957			ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
  3958						 p, offset, PAGE_SIZE, 0,
  3959						 &epd->bio,
  3960						 end_bio_extent_buffer_writepage,
  3961						 0, 0, 0, false);
  3962			if (ret) {
  3963				set_btree_ioerr(p);
  3964				if (PageWriteback(p))
  3965					end_page_writeback(p);
  3966				if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
  3967					end_extent_buffer_writeback(eb);
  3968				ret = -EIO;
  3969				break;
  3970			}
  3971			offset += PAGE_SIZE;
  3972			update_nr_written(wbc, 1);
  3973			unlock_page(p);
  3974		}
  3975	
  3976		if (unlikely(ret)) {
  3977			for (; i < num_pages; i++) {
  3978				struct page *p = eb->pages[i];
  3979				clear_page_dirty_for_io(p);
  3980				unlock_page(p);
  3981			}
  3982		}
  3983	
  3984		return ret;
  3985	}
  3986	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 73975 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
  2020-06-17 18:35           ` [PATCH v2 " Boris Burkov
  2020-06-17 18:51             ` Filipe Manana
@ 2020-06-18 14:46             ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: David Sterba @ 2020-06-18 14:46 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	linux-kernel, kernel-team

On Wed, Jun 17, 2020 at 11:35:19AM -0700, Boris Burkov wrote:
[...]

> Signed-off-by: Boris Burkov <boris@bur.io>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-18 14:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 16:27 [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race Boris Burkov
2020-06-17 17:20 ` Filipe Manana
2020-06-17 17:36   ` Filipe Manana
2020-06-17 17:43   ` Chris Mason
2020-06-17 18:11     ` Filipe Manana
2020-06-17 18:19       ` Josef Bacik
2020-06-17 18:26         ` Filipe Manana
2020-06-17 18:35           ` [PATCH v2 " Boris Burkov
2020-06-17 18:51             ` Filipe Manana
2020-06-18 14:46             ` David Sterba
2020-06-17 21:06 ` [PATCH " kernel test robot
2020-06-17 22:38 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).