All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@gmail.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>, kernel-team@fb.com
Subject: Re: [PATCH v3 4/5] btrfs: do not infinite loop in data reclaim if we aborted
Date: Tue, 28 Sep 2021 11:22:28 +0100	[thread overview]
Message-ID: <CAL3q7H5jsdCY51xNP2x6exu_pzDwJxvvBJsAce4ZzemzNzF=Zw@mail.gmail.com> (raw)
In-Reply-To: <03b2f64b3acc918b67c2fef6d4bfce70ab12ce3b.1632765815.git.josef@toxicpanda.com>

On Mon, Sep 27, 2021 at 7:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Error injection stressing uncovered a busy loop in our data reclaim
> loop.  There are two cases here, one where we loop creating block groups
> until space_info->full is set, or in the main loop we will skip erroring
> out any tickets if space_info->full == 0.  Unfortunately if we aborted
> the transaction then we will never allocate chunks or reclaim any space
> and thus never get ->full, and you'll see stack traces like this
>
> watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139]
> CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G        W         5.13.0-rc1+ #328
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
> Workqueue: events_unbound btrfs_async_reclaim_data_space
> RIP: 0010:btrfs_join_transaction+0x12/0x20
> RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246
> RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000
> RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0
> R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff
> R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008
> FS:  0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0
> Call Trace:
>  flush_space+0x4a8/0x660
>  btrfs_async_reclaim_data_space+0x55/0x130
>  process_one_work+0x1e9/0x380
>  worker_thread+0x53/0x3e0
>  ? process_one_work+0x380/0x380
>  kthread+0x118/0x140
>  ? __kthread_bind_mask+0x60/0x60
>  ret_from_fork+0x1f/0x30
>
> Fix this by checking to see if we have a btrfs fs error in either of the
> reclaim loops, and if so fail the tickets and bail.  In addition to
> this, fix maybe_fail_all_tickets() to not try to grant tickets if we've
> aborted, simply fail everything.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/space-info.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index aa5be0b24987..63423f9729c5 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -885,6 +885,7 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
>  {
>         struct reserve_ticket *ticket;
>         u64 tickets_id = space_info->tickets_id;
> +       const bool aborted = btrfs_has_fs_error(fs_info);
>
>         trace_btrfs_fail_all_tickets(fs_info, space_info);
>
> @@ -898,16 +899,19 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
>                 ticket = list_first_entry(&space_info->tickets,
>                                           struct reserve_ticket, list);
>
> -               if (ticket->steal &&
> +               if (!aborted && ticket->steal &&
>                     steal_from_global_rsv(fs_info, space_info, ticket))
>                         return true;
>
> -               if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
> +               if (!aborted && btrfs_test_opt(fs_info, ENOSPC_DEBUG))
>                         btrfs_info(fs_info, "failing ticket with %llu bytes",
>                                    ticket->bytes);
>
>                 remove_ticket(space_info, ticket);
> -               ticket->error = -ENOSPC;
> +               if (aborted)
> +                       ticket->error = -EIO;
> +               else
> +                       ticket->error = -ENOSPC;
>                 wake_up(&ticket->wait);
>
>                 /*
> @@ -916,7 +920,8 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
>                  * here to see if we can make progress with the next ticket in
>                  * the list.
>                  */
> -               btrfs_try_granting_tickets(fs_info, space_info);
> +               if (!aborted)
> +                       btrfs_try_granting_tickets(fs_info, space_info);
>         }
>         return (tickets_id != space_info->tickets_id);
>  }
> @@ -1172,6 +1177,10 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work)
>                         spin_unlock(&space_info->lock);
>                         return;
>                 }
> +
> +               /* Something happened, fail everything and bail. */
> +               if (btrfs_has_fs_error(fs_info))
> +                       goto aborted_fs;
>                 last_tickets_id = space_info->tickets_id;
>                 spin_unlock(&space_info->lock);
>         }
> @@ -1202,9 +1211,19 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work)
>                         } else {
>                                 flush_state = 0;
>                         }
> +
> +                       /* Something happened, fail everything and bail. */
> +                       if (btrfs_has_fs_error(fs_info))
> +                               goto aborted_fs;
> +
>                 }
>                 spin_unlock(&space_info->lock);
>         }
> +       return;
> +aborted_fs:
> +       maybe_fail_all_tickets(fs_info, space_info);
> +       space_info->flush = 0;
> +       spin_unlock(&space_info->lock);
>  }
>
>  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info)
> --
> 2.26.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

  reply	other threads:[~2021-09-28 10:23 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-27 18:05 [PATCH v3 0/5] Miscellaneous error handling patches Josef Bacik
2021-09-27 18:05 ` [PATCH v3 1/5] btrfs: change handle_fs_error in recover_log_trees to aborts Josef Bacik
2021-09-28  5:46   ` Anand Jain
2021-09-27 18:05 ` [PATCH v3 2/5] btrfs: change error handling for btrfs_delete_*_in_log Josef Bacik
2021-09-28  9:53   ` Filipe Manana
2021-09-27 18:05 ` [PATCH v3 3/5] btrfs: add a BTRFS_FS_ERROR helper Josef Bacik
2021-09-28  7:38   ` Anand Jain
2021-09-27 18:05 ` [PATCH v3 4/5] btrfs: do not infinite loop in data reclaim if we aborted Josef Bacik
2021-09-28 10:22   ` Filipe Manana [this message]
2021-10-01 21:48   ` kernel test robot
2021-10-01 21:48     ` kernel test robot
2021-09-27 18:05 ` [PATCH v3 5/5] btrfs: fix abort logic in btrfs_replace_file_extents Josef Bacik
2021-09-28 10:07   ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H5jsdCY51xNP2x6exu_pzDwJxvvBJsAce4ZzemzNzF=Zw@mail.gmail.com' \
    --to=fdmanana@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.