All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Liu Bo <bo.li.liu@oracle.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead
Date: Wed, 10 Jun 2015 11:44:17 +0100	[thread overview]
Message-ID: <CAL3q7H4Pxx7PWyJZFjYs4hVxep7Df_V=8cgPb9R_D7vmoNiB7w@mail.gmail.com> (raw)
In-Reply-To: <20150610104129.GD3561@localhost.localdomain>

On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Zygo Blaxell and other users have reported occasional hangs while an
>> inode is being evicted, leading to traces like the following:
>>
>> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
>> [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
>> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
>> [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
>> [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
>> [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
>> [ 5281.981396] Call Trace:
>> [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
>> [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
>> [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
>> [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
>> [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
>> [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
>> [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
>> [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
>> [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
>> [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
>> [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
>> [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
>> [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
>> [ 5282.001299] 1 lock held by rm/20488:
>> [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
>>
>> This happens when we have readahead, which calls readpages(), happening
>> right before the inode eviction handler is invoked. So the reason is
>> essentially:
>>
>> 1) readpages() is called while a reference on the inode is held, so
>>    eviction can not be triggered before readpages() returns. It also
>>    locks one or more ranges in the inode's io_tree (which is done at
>>    extent_io.c:__do_contiguous_readpages());
>>
>> 2) readpages() submits several read bios, all with an end io callback
>>    that runs extent_io.c:end_bio_extent_readpage() and that is executed
>>    by other task when a bio finishes, corresponding to a work queue
>>    (fs_info->end_io_workers) worker kthread. This callback unlocks
>>    the ranges in the inode's io_tree that were previously locked in
>>    step 1;
>>
>> 3) readpages() returns, the reference on the inode is dropped;
>>
>> 4) One or more of the read bios previously submitted are still not
>>    complete (their end io callback was not yet invoked or has not
>>    yet finished execution);
>>
>> 5) Inode eviction is triggered (through an unlink call for example).
>>    The inode reference count was not incremented before submitting
>>    the read bios, therefore this is possible;
>>
>> 6) The eviction handler starts executing and enters the loop that
>>    iterates over all extent states in the inode's io_tree;
>
> One question here:
>  btrfs_evict_inode()->
>         evict_inode_truncate_pages()->
>                 truncate_inode_pages_final()
>
> truncate_inode_pages_final() finds all pages in the mapping and
> lock_page() on each page, if there is a readpage running by other tasks,
> it should be blocked on lock_page() until end_bio_extent_readpage()
> does a unlock_page().
>
> How does eviction handler bypasses this and enters 'extent_state' loop?
>
> Am I missing something?

Yes, you are missing something.
end_bio_extent_readpage() unlocks the pages and only after it does
unlock the ranges in the inode's io tree.

thanks


>
> Thanks,
>
> -liubo
>>
>> 7) The loop picks one extent state record and uses its ->start and
>>    ->end fields, after releasing the inode's io_tree spinlock, to
>>    call lock_extent_bits() and clear_extent_bit(). The call to lock
>>    the range [state->start, state->end] blocks because the whole
>>    range or a part of it was locked by the previous call to
>>    readpages() and the corresponding end io callback, which unlocks
>>    the range was not yet executed;
>>
>> 8) The end io callback for the read bio is executed and unlocks the
>>    range [state->start, state->end] (or a superset of that range).
>>    And at clear_extent_bit() the extent_state record state is used
>>    as a second argument to split_state(), which sets state->start to
>>    a larger value;
>>
>> 9) The task executing the eviction handler is woken up by the task
>>    executing the bio's end io callback (through clear_state_bit) and
>>    the eviction handler locks the range
>>    [old value for state->start, state->end]. Shortly after, when
>>    calling clear_extent_bit(), it unlocks the range
>>    [new value for state->start, state->end], so it ends up unlocking
>>    only part of the range that it locked, leaving an extent state
>>    record in the io_tree that represents the unlocked subrange;
>>
>> 10) The eviction handler loop, in its next iteration, gets the
>>     extent_state record for the subrange that it did not unlock in the
>>     previous step and then tries to lock it, resulting in an hang.
>>
>> So fix this by not using the ->start and ->end fields of an existing
>> extent_state record. This is a simple solution, and an alternative
>> could be to bump the inode's reference count before submitting each
>> read bio and having it dropped in the bio's end io callback. But that
>> would be a more invasive/complex change and would not protect against
>> other possible places that are not holding a reference on the inode
>> as well. Something to consider in the future.
>>
>> Many thanks to Zygo Blaxell for reporting, in the mailing list, the
>> issue, a set of scripts to trigger it and testing this fix.
>>
>> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
>> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 26 +++++++++++++++++++++-----
>>  1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 0020b56..ef05800 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode)
>>       }
>>       write_unlock(&map_tree->lock);
>>
>> +     /*
>> +      * Keep looping until we have no more ranges in the io tree.
>> +      * We can have ongoing bios started by readpages (called from readahead)
>> +      * that didn't get their end io callbacks called yet or they are still
>> +      * in progress ((extent_io.c:end_bio_extent_readpage()). This means some
>> +      * ranges can still be locked and eviction started because before
>> +      * submitting those bios, which are executed by a separate task (work
>> +      * queue kthread), inode references (inode->i_count) were not taken
>> +      * (which would be dropped in the end io callback of each bio).
>> +      * Therefore here we effectively end up waiting for those bios and
>> +      * anyone else holding locked ranges without having bumped the inode's
>> +      * reference count - if we don't do it, when they access the inode's
>> +      * io_tree to unlock a range it may be too late, leading to an
>> +      * use-after-free issue.
>> +      */
>>       spin_lock(&io_tree->lock);
>>       while (!RB_EMPTY_ROOT(&io_tree->state)) {
>>               struct extent_state *state;
>>               struct extent_state *cached_state = NULL;
>> +             u64 start;
>> +             u64 end;
>>
>>               node = rb_first(&io_tree->state);
>>               state = rb_entry(node, struct extent_state, rb_node);
>> -             atomic_inc(&state->refs);
>> +             start = state->start;
>> +             end = state->end;
>>               spin_unlock(&io_tree->lock);
>>
>> -             lock_extent_bits(io_tree, state->start, state->end,
>> -                              0, &cached_state);
>> -             clear_extent_bit(io_tree, state->start, state->end,
>> +             lock_extent_bits(io_tree, start, end, 0, &cached_state);
>> +             clear_extent_bit(io_tree, start, end,
>>                                EXTENT_LOCKED | EXTENT_DIRTY |
>>                                EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
>>                                EXTENT_DEFRAG, 1, 1,
>>                                &cached_state, GFP_NOFS);
>> -             free_extent_state(state);
>>
>>               cond_resched();
>>               spin_lock(&io_tree->lock);
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2015-06-10 10:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-25 23:55 [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead fdmanana
2015-06-10 10:41 ` Liu Bo
2015-06-10 10:44   ` Filipe Manana [this message]
2015-06-10 11:05     ` Liu Bo
2015-06-10 11:12       ` Filipe Manana
2015-06-10 11:27         ` Liu Bo
2015-06-10 11:51           ` Filipe Manana
2015-06-10 13:41             ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H4Pxx7PWyJZFjYs4hVxep7Df_V=8cgPb9R_D7vmoNiB7w@mail.gmail.com' \
    --to=fdmanana@kernel.org \
    --cc=bo.li.liu@oracle.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.