All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@canonical.com>
To: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: linux-block <linux-block@vger.kernel.org>,
	Roland Kammerer <roland.kammerer@linbit.com>,
	Jens Axboe <axboe@kernel.dk>, NeilBrown <neilb@suse.com>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Shaohua Li <shli@kernel.org>, Alasdair Kergon <agk@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	"open list:DEVICE-MAPPER (LVM)" <dm-devel@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Takashi Iwai <tiwai@suse.de>, Jiri Kosina <jkosina@suse.cz>,
	Zheng Liu <gnehzuil.liu@gmail.com>,
	Keith Busch <keith.busch@intel.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"open list:BCACHE (BLOCK LAYER CACHE)"
	<linux-bcache@vger.kernel.org>,
	"open list:SOFTWARE RAID (Multiple Disks) SUPPORT"
	<linux-raid@vger.kernel.org>
Subject: Re: [RFC] block: fix blk_queue_split() resource exhaustion
Date: Sat, 2 Jul 2016 18:03:22 +0800	[thread overview]
Message-ID: <CACVXFVPK_g3uSVynnA=4-F3nPPoP9U_CgQRf7-tFP4sziNBD8A@mail.gmail.com> (raw)
In-Reply-To: <20160628084534.GI3239@soda.linbit>

On Tue, Jun 28, 2016 at 4:45 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Sat, Jun 25, 2016 at 05:30:29PM +0800, Ming Lei wrote:
>> On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
>> <lars.ellenberg@linbit.com> wrote:
>> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
>> >> >
>> >> > This is not a theoretical problem.
>> >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
>> >> > "max-buffers" setting, without this patch we have a reproducible deadlock.
>> >>
>> >> Is there any log about the deadlock? And is there any lockdep warning
>> >> if it is enabled?
>> >
>> > In DRBD, to avoid potentially very long internal queues as we wait for
>> > our replication peer device and local backend, we limit the number of
>> > in-flight bios we accept, and block in our ->make_request_fn() if that
>> > number exceeds a configured watermark ("max-buffers").
>> >
>> > Works fine, as long as we could assume that once our make_request_fn()
>> > returns, any bios we "recursively" submitted against the local backend
>> > would be dispatched. Which used to be the case.
>> >
>> > commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
>> > introduced blk_queue_split(), which will split any bio that is
>> > violating the queue limits into a smaller piece to be processed
>> > right away, and queue the "rest" on current->bio_list to be
>> > processed by the iteration in generic_make_request() once the
>> > current ->make_request_fn() returns.
>> >
>> > Before that, any user was supposed to go through bio_add_page(),
>> > which would call our merge bvec function, and that should already
>> > be sufficient to not violate queue limits.
>> >
>> > Previously, typically the next in line bio to be processed would
>> > be the cloned one we passed down to our backend device, which in
>> > case of further stacking devices (e.g. logical volume, raid1 or
>> > similar) may again push further bios on that list.
>> > All of which would simply be processed in turn.
>>
>> Could you clarify if the issue is triggered in drbd without dm/md involved?
>> Or the issue is always triggered with dm/md over drbd?
>>
>> As Mike mentioned, there is one known issue with dm-snapshot.
>
>
> The issue can always be triggered, even if only DRBD is involved.
>
>> If your ->make_request_fn() is called several times, each time
>> at least you should dispatch one bio returnd from blk_queue_split(),
>> so I don't understand why even one bio isn't dispatched out.
>
> I'll try to "visualize" the mechanics of "my" deadlock here.
>
> Just to clarify the effect, assume we had a driver that
> for $reasons would limit the number of in-flight IO to
> one single bio.
>
> === before bio_queue_split()
>
> Previously, if something would want to read some range of blocks,
> it would allocate a bio, call bio_add_page() a number of times,
> and once the bio was "full", call generic_make_request(),
> and fill the next bio, then submit that one.
>
> Stacking: "single_in_flight" (q1) -> "sda" (q2)
>
>   generic_make_request(bio)     current->bio_list       in-flight
>    B_orig_1                             NULL            0
>     q1->make_request_fn(B_orig_1)       empty
>                                                         1
>         recursive call, queues:         B_1_remapped
>     iterative call:
>     q2->make_request_fn(B_1_remapped)   empty
>                 -> actually dispatched to hardware
>   return of generic_make_request(B_orig_1).
>    B_orig_2
>     q1->make_request_fn(B_orig_1)
>                                                         1
>         blocks, waits for in-flight to drop
>         ...
>         completion of B_orig_1                          0
>
>         recursive call, queues:         B_2_remapped
>     iterative call:
>     q2->make_request_fn(B_2_remapped)   empty
>                 -> actually dispatched to hardware
>   return of generic_make_request(B_orig_2).
>
>
> === with bio_queue_split()
>
> Now, uppser layers buils one large bio.
>
>   generic_make_request(bio)     current->bio_list       in-flight
>    B_orig                               NULL            0
>     q1->make_request_fn(B_orig)         empty
>     blk_queue_split(B_orig) splits into
>                                         B_orig_r1
>     B_orig_s1
>                                                         1
>         recursive call, queues:         B_orig_r1, B_s1_remapped
>     iterative call:
>     q1->make_request_fn(B_orig_r1)      B_s1_remapped
>         blocks, waits for in-flight to drop
>         ...
>         which never happens,
>         because B_s1_remapped has not even been submitted to q2 yet,
>         let alone dispatched to hardware.
>
>
> Obviously we don't limit ourselves to just one request, but with larger
> incoming bios, with the number of times we split them, with the number
> of stacking layers below us, or even layers below us that *also*
> call blk_queue_split (or equivalent open coded clone and split)
> themselves to split even further, things get worse.

Thanks for your so detailed explanation!

Now I believe the issue is really from blk_queue_split(), and I will comment
on your patch later.

thanks,

  reply	other threads:[~2016-07-02 10:03 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-22  8:22 [RFC] block: fix blk_queue_split() resource exhaustion Lars Ellenberg
2016-06-22  8:22 ` Lars Ellenberg
2016-06-24 11:36 ` Ming Lei
2016-06-24 11:36   ` Ming Lei
2016-06-24 14:27   ` Lars Ellenberg
2016-06-24 14:27     ` Lars Ellenberg
2016-06-24 15:15     ` Mike Snitzer
2016-06-24 15:15       ` Mike Snitzer
2016-06-28  8:24       ` Lars Ellenberg
2016-06-28  8:24         ` Lars Ellenberg
2016-06-25  9:30     ` [RFC] " Ming Lei
2016-06-25  9:30       ` Ming Lei
2016-06-28  8:45       ` Lars Ellenberg
2016-06-28  8:45         ` Lars Ellenberg
2016-07-02 10:03         ` Ming Lei [this message]
2016-07-02 10:03           ` Ming Lei
2016-07-02 10:28 ` Ming Lei
2016-07-02 10:28   ` Ming Lei
2016-07-04  8:20   ` Lars Ellenberg
2016-07-04  8:20     ` Lars Ellenberg
2016-07-04 10:47     ` Ming Lei
2016-07-04 10:47       ` Ming Lei
2016-07-06 12:38       ` Lars Ellenberg
2016-07-06 12:38         ` Lars Ellenberg
2016-07-06 15:57         ` Ming Lei
2016-07-06 15:57           ` Ming Lei
2016-07-07  8:03           ` Lars Ellenberg
2016-07-07  8:03             ` Lars Ellenberg
2016-07-07  8:03             ` Lars Ellenberg
2016-07-07 13:14             ` Ming Lei
2016-07-07 13:14               ` Ming Lei
2016-07-07  5:35 ` [dm-devel] " NeilBrown
2016-07-07  5:35   ` NeilBrown
2016-07-07  8:16   ` Lars Ellenberg
2016-07-07 12:39     ` Lars Ellenberg
2016-07-07 12:39       ` Lars Ellenberg
2016-07-07 12:47       ` Mike Snitzer
2016-07-07 22:07     ` [dm-devel] [RFC] " NeilBrown
2016-07-08  8:02       ` Lars Ellenberg
2016-07-08  9:39         ` NeilBrown
2016-07-08 13:00           ` Lars Ellenberg
2016-07-08 13:24             ` Re[2]: " Pavel Goran
2016-07-08 13:59             ` Lars Ellenberg
2016-07-08 11:08       ` Ming Lei
2016-07-08 11:08         ` Ming Lei
2016-07-08 11:08         ` Ming Lei
2016-07-08 12:52         ` Lars Ellenberg
2016-07-08 12:52           ` Lars Ellenberg
2016-07-08 13:05           ` Mike Snitzer
2016-07-08 13:05             ` Mike Snitzer
2016-07-07 12:45   ` Mike Snitzer
2016-07-07 22:40     ` NeilBrown
2016-07-07 14:36 ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACVXFVPK_g3uSVynnA=4-F3nPPoP9U_CgQRf7-tFP4sziNBD8A@mail.gmail.com' \
    --to=ming.lei@canonical.com \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=gnehzuil.liu@gmail.com \
    --cc=jkosina@suse.cz \
    --cc=keith.busch@intel.com \
    --cc=kent.overstreet@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mingo@redhat.com \
    --cc=neilb@suse.com \
    --cc=peterz@infradead.org \
    --cc=roland.kammerer@linbit.com \
    --cc=shli@kernel.org \
    --cc=snitzer@redhat.com \
    --cc=tiwai@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.