All of lore.kernel.org
 help / color / mirror / Atom feed
From: Coly Li <colyli@suse.de>
To: Shaohua Li <shli@kernel.org>
Cc: NeilBrown <neilb@suse.com>, NeilBrown <neilb@suse.de>,
	linux-raid@vger.kernel.org, Shaohua Li <shli@fb.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Mon, 20 Feb 2017 16:07:38 +0800	[thread overview]
Message-ID: <7178ADEA-6263-4FB0-95A5-96E5F71A9740@suse.de> (raw)
In-Reply-To: <20170220070430.4mca7clpaw7kpj4j@kernel.org>




发自我的 iPhone
> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
> 
>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>> 
>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>> 
>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>> happen in a separate function, which I agree with, but there is 
>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>> request can wait for another request to the same device it can
>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>> raise_barrier() between one request and the next, then the next has
>>>>> to wait for the resync request, which has to wait for the first
>>>>> request. As the first request will be stuck in the queue in 
>>>>> generic_make_request(), you get a deadlock.
>>>> 
>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>> can I understand it as the data structures like plug->pending and
>>>> conf->pending_bio_list ?
>>> 
>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>> is the only queue I am talking about.  I'm not referring to
>>> plug->pending or conf->pending_bio_list at all.
>>> 
>>>> 
>>>> I still don't get the point of deadlock, let me try to explain why I
>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>> raise a barrier, there are 3 possible conditions,
>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>> its conf->nr_pending[idx]
>>> 
>>> Not quite.
>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>> to be zero.  We can assume this happens immediately.
>>> Then the resync_task will increment ->barrier[idx].
>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>> The processing of that first bio will have submitted bios to the
>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>> will not be processed until raid1_make_request() completes.
>>> 
>>> The loop in raid1_make_request() will then call make_request_fn() which
>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>> zero.
>> 
>> Thinking more carefully about this.. the 'idx' that the second bio will
>> wait for will normally be different, so there won't be a deadlock after
>> all.
>> 
>> However it is possible for hash_long() to produce the same idx for two
>> consecutive barrier_units so there is still the possibility of a
>> deadlock, though it isn't as likely as I thought at first.
> 
> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
> please check.
> 
> Thanks,
> Shaohua
> 

Hmm, please hold, I am still thinking of it. With barrier bucket and hash_long(), I don't see dead lock yet. For raid10 it might happen, but once we have barrier bucket on it , there will no deadlock.

My question is, this deadlock only happens when a big bio is split, and the split small bios are continuous, and the resync io visiting barrier buckets in sequntial order too. In the case if adjacent split regular bios or resync bios hit same barrier bucket, it will be a very big failure of hash design, and should have been found already. But no one complain it, so I don't convince myself tje deadlock is real with io barrier buckets (this is what Neil concerns).

For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..

The loop in raid1_make_request() is quite high level, I am not sure whether CPU brach pridiction may work correctly, especially when it is a big DISCARD bio, using function pointer may drop a possible brach.

So I need to check what we get and lose when use function pointer or not. If it is not urgent, please hold this patch for a while.

The only thing I worry in the bellowed patch is, if a very big DISCARD bio comes, will the kernel space stack trend to be overflow?

Thanks.

Coly





> From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
> Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@fb.com>
> From: Shaohua Li <shli@fb.com>
> Date: Sun, 19 Feb 2017 22:18:32 -0800
> Subject: [PATCH] md/raid1/10: fix potential deadlock
> 
> Neil Brown pointed out a potential deadlock in raid 10 code with
> bio_split/chain. The raid1 code could have the same issue, but recent
> barrier rework makes it less likely to happen. The deadlock happens in
> below sequence:
> 
> 1. generic_make_request(bio), this will set current->bio_list
> 2. raid10_make_request will split bio to bio1 and bio2
> 3. __make_request(bio1), wait_barrer, add underlayer disk bio to
> current->bio_list
> 4. __make_request(bio2), wait_barrer
> 
> If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
> raise_barrier waits for IO completion from 3. And since raise_barrier
> sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
> dispatched because raid10_make_request() doesn't finished yet.
> 
> The solution is to adjust the IO ordering. Quotes from Neil:
> "
> It is much safer to:
> 
>    if (need to split) {
>        split = bio_split(bio, ...)
>        bio_chain(...)
>        make_request_fn(split);
>        generic_make_request(bio);
>   } else
>        make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.
> "
> 
> Cc: Coly Li <colyli@suse.de>
> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
> drivers/md/raid1.c  | 28 ++++++++++++++--------------
> drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
> 2 files changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 676f72d..e55d865 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>    sector_t sectors;
> 
>    /* if bio exceeds barrier unit boundary, split it */
> -    do {
> -        sectors = align_to_barrier_unit_end(
> -                bio->bi_iter.bi_sector, bio_sectors(bio));
> -        if (sectors < bio_sectors(bio)) {
> -            split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    sectors = align_to_barrier_unit_end(
> +            bio->bi_iter.bi_sector, bio_sectors(bio));
> +    if (sectors < bio_sectors(bio)) {
> +        split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        if (bio_data_dir(split) == READ)
> -            raid1_read_request(mddev, split);
> -        else
> -            raid1_write_request(mddev, split);
> -    } while (split != bio);
> +    if (bio_data_dir(split) == READ)
> +        raid1_read_request(mddev, split);
> +    else
> +        raid1_write_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> }
> 
> static void raid1_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a1f8e98..b495049 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>        return;
>    }
> 
> -    do {
> -
> -        /*
> -         * If this request crosses a chunk boundary, we need to split
> -         * it.
> -         */
> -        if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> -                 bio_sectors(bio) > chunk_sects
> -                 && (conf->geo.near_copies < conf->geo.raid_disks
> -                 || conf->prev.near_copies <
> -                 conf->prev.raid_disks))) {
> -            split = bio_split(bio, chunk_sects -
> -                      (bio->bi_iter.bi_sector &
> -                       (chunk_sects - 1)),
> -                      GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    /*
> +     * If this request crosses a chunk boundary, we need to split
> +     * it.
> +     */
> +    if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> +             bio_sectors(bio) > chunk_sects
> +             && (conf->geo.near_copies < conf->geo.raid_disks
> +             || conf->prev.near_copies <
> +             conf->prev.raid_disks))) {
> +        split = bio_split(bio, chunk_sects -
> +                  (bio->bi_iter.bi_sector &
> +                   (chunk_sects - 1)),
> +                  GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        __make_request(mddev, split);
> -    } while (split != bio);
> +    __make_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> 
>    /* In case raid10d snuck in to freeze_array */
>    wake_up(&conf->wait_barrier);
> -- 
> 2.9.3
> 


  reply	other threads:[~2017-02-20  8:07 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-15 17:15   ` Coly Li
2017-02-16  2:25   ` Shaohua Li
2017-02-17 18:42     ` Coly Li
2017-02-16  7:04   ` NeilBrown
2017-02-17  7:56     ` Coly Li
2017-02-17 18:35       ` Coly Li
2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-02-16 17:05   ` Coly Li
2017-02-17 12:40     ` Coly Li
2017-02-16  7:04 ` NeilBrown
2017-02-17  6:56   ` Coly Li
2017-02-19 23:50     ` NeilBrown
2017-02-20  2:51       ` NeilBrown
2017-02-20  7:04         ` Shaohua Li
2017-02-20  8:07           ` Coly Li [this message]
2017-02-20  8:30             ` Coly Li
2017-02-20 18:14             ` Wols Lists
2017-02-21 11:30               ` Coly Li
2017-02-21 19:20                 ` Wols Lists
2017-02-21 20:16                   ` Coly Li
2017-02-21  0:29             ` NeilBrown
2017-02-21  9:45               ` Coly Li
2017-02-21 17:45                 ` Shaohua Li
2017-02-21 20:09                   ` Coly Li
2017-02-23  5:54                     ` Coly Li
2017-02-23 17:34                       ` Shaohua Li
2017-02-23 19:31                         ` Coly Li
2017-02-23 19:58                           ` Shaohua Li
2017-02-24 17:02                             ` Coly Li
2017-02-24 10:19                           ` 王金浦
2017-02-28 19:42                             ` Shaohua Li
2017-03-01 17:01                               ` 王金浦
2017-02-23 23:14                       ` NeilBrown
2017-02-24 17:06                         ` Coly Li
2017-02-24 17:17                           ` Shaohua Li
2017-02-24 18:57                             ` Coly Li
2017-02-24 19:02                               ` Shaohua Li
2017-02-24 19:19                                 ` Coly Li
2017-02-17 19:41   ` Shaohua Li
2017-02-18  2:40     ` Coly Li
2017-02-19 23:42     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7178ADEA-6263-4FB0-95A5-96E5F71A9740@suse.de \
    --to=colyli@suse.de \
    --cc=gqjiang@suse.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=neilb@suse.de \
    --cc=shli@fb.com \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.