All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shaohua Li <shli@kernel.org>
To: Coly Li <colyli@suse.de>
Cc: NeilBrown <neilb@suse.com>,
	linux-raid@vger.kernel.org, Shaohua Li <shli@fb.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Tue, 21 Feb 2017 09:45:42 -0800	[thread overview]
Message-ID: <20170221174542.rer73ywil3oq26gj@kernel.org> (raw)
In-Reply-To: <f36e85b1-cca1-5be4-a171-d64057cc56a3@suse.de>

On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
> On 2017/2/21 上午8:29, NeilBrown wrote:
> > On Mon, Feb 20 2017, Coly Li wrote:
> > 
> >>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
> >>> 
> >>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> >>>>> On Mon, Feb 20 2017, NeilBrown wrote:
> >>>>> 
> >>>>>> On Fri, Feb 17 2017, Coly Li wrote:
> >>>>>> 
> >>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
> >>>>>>> going to change this as Shaohua wantsthe spitting to 
> >>>>>>> happen in a separate function, which I agree with, but
> >>>>>>> there is something else wrong here. Calling
> >>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
> >>>>>>> It is OK for simple devices, but when one request can
> >>>>>>> wait for another request to the same device it can 
> >>>>>>> deadlock. This can happen with raid1.  If a resync
> >>>>>>> request calls raise_barrier() between one request and
> >>>>>>> the next, then the next has to wait for the resync
> >>>>>>> request, which has to wait for the first request. As
> >>>>>>> the first request will be stuck in the queue in 
> >>>>>>> generic_make_request(), you get a deadlock.
> >>>>>> 
> >>>>>> For md raid1, queue in generic_make_request(), can I
> >>>>>> understand it as bio_list_on_stack in this function? And
> >>>>>> queue in underlying device, can I understand it as the
> >>>>>> data structures like plug->pending and 
> >>>>>> conf->pending_bio_list ?
> >>>>> 
> >>>>> Yes, the queue in generic_make_request() is the
> >>>>> bio_list_on_stack.  That is the only queue I am talking
> >>>>> about.  I'm not referring to plug->pending or
> >>>>> conf->pending_bio_list at all.
> >>>>> 
> >>>>>> 
> >>>>>> I still don't get the point of deadlock, let me try to
> >>>>>> explain why I don't see the possible deadlock. If a bio
> >>>>>> is split, and the first part is processed by
> >>>>>> make_request_fn(), and then a resync comes and it will 
> >>>>>> raise a barrier, there are 3 possible conditions, - the
> >>>>>> resync I/O tries to raise barrier on same bucket of the
> >>>>>> first regular bio. Then the resync task has to wait to
> >>>>>> the first bio drops its conf->nr_pending[idx]
> >>>>> 
> >>>>> Not quite. First, the resync task (in raise_barrier()) will
> >>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
> >>>>> happens immediately. Then the resync_task will increment
> >>>>> ->barrier[idx]. Only then will it wait for the first bio to
> >>>>> drop ->nr_pending[idx]. The processing of that first bio
> >>>>> will have submitted bios to the underlying device, and they
> >>>>> will be in the bio_list_on_stack queue, and will not be
> >>>>> processed until raid1_make_request() completes.
> >>>>> 
> >>>>> The loop in raid1_make_request() will then call
> >>>>> make_request_fn() which will call wait_barrier(), which
> >>>>> will wait for ->barrier[idx] to be zero.
> >>>> 
> >>>> Thinking more carefully about this.. the 'idx' that the
> >>>> second bio will wait for will normally be different, so there
> >>>> won't be a deadlock after all.
> >>>> 
> >>>> However it is possible for hash_long() to produce the same
> >>>> idx for two consecutive barrier_units so there is still the
> >>>> possibility of a deadlock, though it isn't as likely as I
> >>>> thought at first.
> >>> 
> >>> Wrapped the function pointer issue Neil pointed out into Coly's
> >>> original patch. Also fix a 'use-after-free' bug. For the
> >>> deadlock issue, I'll add below patch, please check.
> >>> 
> >>> Thanks, Shaohua
> >>> 
> >> 
> 
> Neil,
> 
> Thanks for your patient explanation, I feel I come to follow up what
> you mean. Let me try to re-tell what I understand, correct me if I am
> wrong.
> 
> 
> >> Hmm, please hold, I am still thinking of it. With barrier bucket
> >> and hash_long(), I don't see dead lock yet. For raid10 it might
> >> happen, but once we have barrier bucket on it , there will no
> >> deadlock.
> >> 
> >> My question is, this deadlock only happens when a big bio is
> >> split, and the split small bios are continuous, and the resync io
> >> visiting barrier buckets in sequntial order too. In the case if
> >> adjacent split regular bios or resync bios hit same barrier
> >> bucket, it will be a very big failure of hash design, and should
> >> have been found already. But no one complain it, so I don't
> >> convince myself tje deadlock is real with io barrier buckets
> >> (this is what Neil concerns).
> > 
> > I think you are wrong about the design goal of a hash function. 
> > When feed a sequence of inputs, with any stride (i.e. with any
> > constant difference between consecutive inputs), the output of the
> > hash function should appear to be random. A random sequence can
> > produce the same number twice in a row. If the hash function
> > produces a number from 0 to N-1, you would expect two consecutive
> > outputs to be the same about once every N inputs.
> > 
> 
> Yes, you are right. But when I mentioned hash conflict, I limit the
> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
> address divided by 64MB I/O barrier unit size, its value range is
> reduced to [0, 1<<38].
> 
> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
> 
> For DISCARD bio, its maximum size is 4GB, it could be split into 65
> bios at most.
> 
> Then in this patch, the hash question is degraded to: for any
> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
> these 65 integers into range [0, 1023], will any hash conflict happen
> among these integers ?
> 
> I tried a half range [0, 1<<37] to check hash conflict, by writing a
> simple code to emulate hash calculation in the new I/O barrier patch,
> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
> 1<<37] for hash conflict.
> 
> On a 20 core CPU each run spent 7+ hours, finally I find no hash
> conflict detected up to 512 consecutive integers in above limited
> condition. For 1024, there are a lot hash conflict detected.
> 
> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
> almost all existing md raid configuration. So for current kernel
> implementation and real world device, for a single bio, there is no
> possible hash conflict the new I/O barrier patch.
> 
> If bi_iter.bi_size changes from unsigned int to unsigned long in
> future, the above assumption will be wrong. There will be hash
> conflict, and potential dead lock, which is quite implicit. Yes, I
> agree with you. No, bio split inside loop is not perfect.
> 
> > Even if there was no possibility of a deadlock from a resync
> > request happening between two bios, there are other possibilities.
> > 
> 
> The bellowed text makes me know more about raid1 code, but confuses me
> more as well. Here comes my questions,
> 
> > It is not, in general, safe to call mempool_alloc() twice in a
> > row, without first ensuring that the first allocation will get
> > freed by some other thread.  raid1_write_request() allocates from
> > r1bio_pool, and then submits bios to the underlying device, which
> > get queued on bio_list_on_stack.  They will not be processed until
> > after raid1_make_request() completes, so when raid1_make_request
> > loops around and calls raid1_write_request() again, it will try to
> > allocate another r1bio from r1bio_pool, and this might end up
> > waiting for the r1bio which is trapped and cannot complete.
> > 
> 
> Can I say that it is because blk_finish_plug() won't be called before
> raid1_make_request() returns ? Then in raid1_write_request(), mbio
> will be added into plug->pending, but before blk_finish_plug() is
> called, they won't be handled.

blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
current->bio_list, not in plug list.
 
> > As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> > impossible.  If 256 threads all attempt a write (or read) that
> > crosses a boundary, then they will consume all 256 preallocated
> > entries, and want more. If there is no free memory, they will block
> > indefinitely.
> > 
> 
> If raid1_make_request() is modified into this way,
> +	if (bio_data_dir(split) == READ)
> +		raid1_read_request(mddev, split);
> +	else
> +		raid1_write_request(mddev, split);
> +	if (split != bio)
> +		generic_make_request(bio);
> 
> Then the original bio will be added into the bio_list_on_stack of top
> level generic_make_request(), current->bio_list is initialized, when
> generic_make_request() is called nested in raid1_make_request(), the
> split bio will be added into current->bio_list and nothing else happens.
> 
> After the nested generic_make_request() returns, the code back to next
> code of generic_make_request(),
> 2022                         ret = q->make_request_fn(q, bio);
> 2023
> 2024                         blk_queue_exit(q);
> 2025
> 2026                         bio = bio_list_pop(current->bio_list);
> 
> bio_list_pop() will return the second half of the split bio, and it is

So in above sequence, the curent->bio_list will has bios in below sequence:
bios to underlaying disks, second half of original bio

bio_list_pop will pop bios to underlaying disks first, handle them, then the
second half of original bio.

That said, this doesn't work for array stacked 3 layers. Because in 3-layer
array, handling the middle layer bio will make the 3rd layer bio hold to
bio_list again.

Thanks,
Shaohua

  reply	other threads:[~2017-02-21 17:45 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-15 17:15   ` Coly Li
2017-02-16  2:25   ` Shaohua Li
2017-02-17 18:42     ` Coly Li
2017-02-16  7:04   ` NeilBrown
2017-02-17  7:56     ` Coly Li
2017-02-17 18:35       ` Coly Li
2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-02-16 17:05   ` Coly Li
2017-02-17 12:40     ` Coly Li
2017-02-16  7:04 ` NeilBrown
2017-02-17  6:56   ` Coly Li
2017-02-19 23:50     ` NeilBrown
2017-02-20  2:51       ` NeilBrown
2017-02-20  7:04         ` Shaohua Li
2017-02-20  8:07           ` Coly Li
2017-02-20  8:30             ` Coly Li
2017-02-20 18:14             ` Wols Lists
2017-02-21 11:30               ` Coly Li
2017-02-21 19:20                 ` Wols Lists
2017-02-21 20:16                   ` Coly Li
2017-02-21  0:29             ` NeilBrown
2017-02-21  9:45               ` Coly Li
2017-02-21 17:45                 ` Shaohua Li [this message]
2017-02-21 20:09                   ` Coly Li
2017-02-23  5:54                     ` Coly Li
2017-02-23 17:34                       ` Shaohua Li
2017-02-23 19:31                         ` Coly Li
2017-02-23 19:58                           ` Shaohua Li
2017-02-24 17:02                             ` Coly Li
2017-02-24 10:19                           ` 王金浦
2017-02-28 19:42                             ` Shaohua Li
2017-03-01 17:01                               ` 王金浦
2017-02-23 23:14                       ` NeilBrown
2017-02-24 17:06                         ` Coly Li
2017-02-24 17:17                           ` Shaohua Li
2017-02-24 18:57                             ` Coly Li
2017-02-24 19:02                               ` Shaohua Li
2017-02-24 19:19                                 ` Coly Li
2017-02-17 19:41   ` Shaohua Li
2017-02-18  2:40     ` Coly Li
2017-02-19 23:42     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170221174542.rer73ywil3oq26gj@kernel.org \
    --to=shli@kernel.org \
    --cc=colyli@suse.de \
    --cc=gqjiang@suse.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=shli@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.