All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>,
	"snitzer@redhat.com" <snitzer@redhat.com>,
	"dm-devel@redhat.com" <dm-devel@redhat.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"osandov@fb.com" <osandov@fb.com>
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
Date: Fri, 19 Jan 2018 23:40:47 +0800	[thread overview]
Message-ID: <20180119154047.GB14827@ming.t460p> (raw)
In-Reply-To: <047f68ec-f51b-190f-2f89-f413325c2540@kernel.dk>

On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote:
> On 1/19/18 12:26 AM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
> >> On 1/18/18 7:32 PM, Ming Lei wrote:
> >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> >>>> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >>>>>> This is all very tiresome.
> >>>>>
> >>>>> Yes, this is tiresome. It is very annoying to me that others keep
> >>>>> introducing so many regressions in such important parts of the kernel.
> >>>>> It is also annoying to me that I get blamed if I report a regression
> >>>>> instead of seeing that the regression gets fixed.
> >>>>
> >>>> I agree, it sucks that any change there introduces the regression. I'm
> >>>> fine with doing the delay insert again until a new patch is proven to be
> >>>> better.
> >>>
> >>> That way is still buggy as I explained, since rerun queue before adding
> >>> request to hctx->dispatch_list isn't correct. Who can make sure the request
> >>> is visible when __blk_mq_run_hw_queue() is called?
> >>
> >> That race basically doesn't exist for a 10ms gap.
> >>
> >>> Not mention this way will cause performance regression again.
> >>
> >> How so? It's _exactly_ the same as what you are proposing, except mine
> >> will potentially run the queue when it need not do so. But given that
> >> these are random 10ms queue kicks because we are screwed, it should not
> >> matter. The key point is that it only should be if we have NO better
> >> options. If it's a frequently occurring event that we have to return
> >> BLK_STS_RESOURCE, then we need to get a way to register an event for
> >> when that condition clears. That event will then kick the necessary
> >> queue(s).
> > 
> > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > scheduling can make progress during the 100ms.
> 
> That's a bug, plain and simple. If someone does "run this queue in
> 100ms" and someone else comes in and says "run this queue now", the
> correct outcome is running this queue now.
> 
> >>>> From the original topic of this email, we have conditions that can cause
> >>>> the driver to not be able to submit an IO. A set of those conditions can
> >>>> only happen if IO is in flight, and those cases we have covered just
> >>>> fine. Another set can potentially trigger without IO being in flight.
> >>>> These are cases where a non-device resource is unavailable at the time
> >>>> of submission. This might be iommu running out of space, for instance,
> >>>> or it might be a memory allocation of some sort. For these cases, we
> >>>> don't get any notification when the shortage clears. All we can do is
> >>>> ensure that we restart operations at some point in the future. We're SOL
> >>>> at that point, but we have to ensure that we make forward progress.
> >>>
> >>> Right, it is a generic issue, not DM-specific one, almost all drivers
> >>> call kmalloc(GFP_ATOMIC) in IO path.
> >>
> >> GFP_ATOMIC basically never fails, unless we are out of memory. The
> > 
> > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
> > possible, and it is mentioned[1] there is such code in mm allocation
> > path, also OOM can happen too.
> > 
> >   if (some randomly generated condition) && (request is atomic)
> >       return NULL;
> > 
> > [1] https://lwn.net/Articles/276731/
> 
> That article is 10 years old. Once you run large scale production, you
> see what the real failures are. Fact is, for zero order allocation, if
> the atomic alloc fails the shit has really hit the fan. In that case, a
> delay of 10ms is not your main issue. It's a total red herring when you
> compare to the frequency of what Bart is seeing. It's noise, and
> irrelevant here. For an atomic zero order allocation failure, doing a
> short random sleep is perfectly fine.
> 
> >> exception is higher order allocations. If a driver has a higher order
> >> atomic allocation in its IO path, the device driver writer needs to be
> >> taken out behind the barn and shot. Simple as that. It will NEVER work
> >> well in a production environment. Witness the disaster that so many NIC
> >> driver writers have learned.
> >>
> >> This is NOT the case we care about here. It's resources that are more
> >> readily depleted because other devices are using them. If it's a high
> >> frequency or generally occurring event, then we simply must have a
> >> callback to restart the queue from that. The condition then becomes
> >> identical to device private starvation, the only difference being from
> >> where we restart the queue.
> >>
> >>> IMO, there is enough time for figuring out a generic solution before
> >>> 4.16 release.
> >>
> >> I would hope so, but the proposed solutions have not filled me with
> >> a lot of confidence in the end result so far.
> >>
> >>>> That last set of conditions better not be a a common occurence, since
> >>>> performance is down the toilet at that point. I don't want to introduce
> >>>> hot path code to rectify it. Have the driver return if that happens in a
> >>>> way that is DIFFERENT from needing a normal restart. The driver knows if
> >>>> this is a resource that will become available when IO completes on this
> >>>> device or not. If we get that return, we have a generic run-again delay.
> >>>
> >>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> >>> it should be DM-only which returns STS_RESOURCE so often.
> >>
> >> Where does the dm STS_RESOURCE error usually come from - what's exact
> >> resource are we running out of?
> > 
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
> 
> That's what I thought. So for a low queue depth underlying queue, it's
> quite possible that this situation can happen. Two potential solutions
> I see:
> 
> 1) As described earlier in this thread, having a mechanism for being
>    notified when the scarce resource becomes available. It would not
>    be hard to tap into the existing sbitmap wait queue for that.
> 
> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>    allocation. I haven't read the dm code to know if this is a
>    possibility or not.
> 
> I'd probably prefer #1. It's a classic case of trying to get the
> request, and if it fails, add ourselves to the sbitmap tag wait
> queue head, retry, and bail if that also fails. Connecting the
> scarce resource and the consumer is the only way to really fix
> this, without bogus arbitrary delays.

Right, as I have replied to Bart, using mod_delayed_work_on() with
returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
resource should fix this issue.

-- 
Ming

WARNING: multiple messages have this Message-ID (diff)
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"snitzer@redhat.com" <snitzer@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"dm-devel@redhat.com" <dm-devel@redhat.com>,
	Bart Van Assche <Bart.VanAssche@wdc.com>,
	"osandov@fb.com" <osandov@fb.com>
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
Date: Fri, 19 Jan 2018 23:40:47 +0800	[thread overview]
Message-ID: <20180119154047.GB14827@ming.t460p> (raw)
In-Reply-To: <047f68ec-f51b-190f-2f89-f413325c2540@kernel.dk>

On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote:
> On 1/19/18 12:26 AM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
> >> On 1/18/18 7:32 PM, Ming Lei wrote:
> >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> >>>> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >>>>>> This is all very tiresome.
> >>>>>
> >>>>> Yes, this is tiresome. It is very annoying to me that others keep
> >>>>> introducing so many regressions in such important parts of the kernel.
> >>>>> It is also annoying to me that I get blamed if I report a regression
> >>>>> instead of seeing that the regression gets fixed.
> >>>>
> >>>> I agree, it sucks that any change there introduces the regression. I'm
> >>>> fine with doing the delay insert again until a new patch is proven to be
> >>>> better.
> >>>
> >>> That way is still buggy as I explained, since rerun queue before adding
> >>> request to hctx->dispatch_list isn't correct. Who can make sure the request
> >>> is visible when __blk_mq_run_hw_queue() is called?
> >>
> >> That race basically doesn't exist for a 10ms gap.
> >>
> >>> Not mention this way will cause performance regression again.
> >>
> >> How so? It's _exactly_ the same as what you are proposing, except mine
> >> will potentially run the queue when it need not do so. But given that
> >> these are random 10ms queue kicks because we are screwed, it should not
> >> matter. The key point is that it only should be if we have NO better
> >> options. If it's a frequently occurring event that we have to return
> >> BLK_STS_RESOURCE, then we need to get a way to register an event for
> >> when that condition clears. That event will then kick the necessary
> >> queue(s).
> > 
> > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > scheduling can make progress during the 100ms.
> 
> That's a bug, plain and simple. If someone does "run this queue in
> 100ms" and someone else comes in and says "run this queue now", the
> correct outcome is running this queue now.
> 
> >>>> From the original topic of this email, we have conditions that can cause
> >>>> the driver to not be able to submit an IO. A set of those conditions can
> >>>> only happen if IO is in flight, and those cases we have covered just
> >>>> fine. Another set can potentially trigger without IO being in flight.
> >>>> These are cases where a non-device resource is unavailable at the time
> >>>> of submission. This might be iommu running out of space, for instance,
> >>>> or it might be a memory allocation of some sort. For these cases, we
> >>>> don't get any notification when the shortage clears. All we can do is
> >>>> ensure that we restart operations at some point in the future. We're SOL
> >>>> at that point, but we have to ensure that we make forward progress.
> >>>
> >>> Right, it is a generic issue, not DM-specific one, almost all drivers
> >>> call kmalloc(GFP_ATOMIC) in IO path.
> >>
> >> GFP_ATOMIC basically never fails, unless we are out of memory. The
> > 
> > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
> > possible, and it is mentioned[1] there is such code in mm allocation
> > path, also OOM can happen too.
> > 
> >   if (some randomly generated condition) && (request is atomic)
> >       return NULL;
> > 
> > [1] https://lwn.net/Articles/276731/
> 
> That article is 10 years old. Once you run large scale production, you
> see what the real failures are. Fact is, for zero order allocation, if
> the atomic alloc fails the shit has really hit the fan. In that case, a
> delay of 10ms is not your main issue. It's a total red herring when you
> compare to the frequency of what Bart is seeing. It's noise, and
> irrelevant here. For an atomic zero order allocation failure, doing a
> short random sleep is perfectly fine.
> 
> >> exception is higher order allocations. If a driver has a higher order
> >> atomic allocation in its IO path, the device driver writer needs to be
> >> taken out behind the barn and shot. Simple as that. It will NEVER work
> >> well in a production environment. Witness the disaster that so many NIC
> >> driver writers have learned.
> >>
> >> This is NOT the case we care about here. It's resources that are more
> >> readily depleted because other devices are using them. If it's a high
> >> frequency or generally occurring event, then we simply must have a
> >> callback to restart the queue from that. The condition then becomes
> >> identical to device private starvation, the only difference being from
> >> where we restart the queue.
> >>
> >>> IMO, there is enough time for figuring out a generic solution before
> >>> 4.16 release.
> >>
> >> I would hope so, but the proposed solutions have not filled me with
> >> a lot of confidence in the end result so far.
> >>
> >>>> That last set of conditions better not be a a common occurence, since
> >>>> performance is down the toilet at that point. I don't want to introduce
> >>>> hot path code to rectify it. Have the driver return if that happens in a
> >>>> way that is DIFFERENT from needing a normal restart. The driver knows if
> >>>> this is a resource that will become available when IO completes on this
> >>>> device or not. If we get that return, we have a generic run-again delay.
> >>>
> >>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> >>> it should be DM-only which returns STS_RESOURCE so often.
> >>
> >> Where does the dm STS_RESOURCE error usually come from - what's exact
> >> resource are we running out of?
> > 
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
> 
> That's what I thought. So for a low queue depth underlying queue, it's
> quite possible that this situation can happen. Two potential solutions
> I see:
> 
> 1) As described earlier in this thread, having a mechanism for being
>    notified when the scarce resource becomes available. It would not
>    be hard to tap into the existing sbitmap wait queue for that.
> 
> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>    allocation. I haven't read the dm code to know if this is a
>    possibility or not.
> 
> I'd probably prefer #1. It's a classic case of trying to get the
> request, and if it fails, add ourselves to the sbitmap tag wait
> queue head, retry, and bail if that also fails. Connecting the
> scarce resource and the consumer is the only way to really fix
> this, without bogus arbitrary delays.

Right, as I have replied to Bart, using mod_delayed_work_on() with
returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
resource should fix this issue.

-- 
Ming

  reply	other threads:[~2018-01-19 15:40 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-18  2:41 [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei
2018-01-18 16:50 ` Bart Van Assche
2018-01-18 17:03   ` Mike Snitzer
2018-01-18 17:03     ` Mike Snitzer
2018-01-18 17:20     ` Bart Van Assche
2018-01-18 17:20       ` Bart Van Assche
2018-01-18 18:30       ` Mike Snitzer
2018-01-18 18:47         ` Bart Van Assche
2018-01-18 18:47           ` Bart Van Assche
2018-01-18 20:11           ` Jens Axboe
2018-01-18 20:11             ` Jens Axboe
2018-01-18 20:48             ` Mike Snitzer
2018-01-18 20:58               ` Bart Van Assche
2018-01-18 20:58                 ` Bart Van Assche
2018-01-18 21:23                 ` Mike Snitzer
2018-01-18 21:23                   ` Mike Snitzer
2018-01-18 21:37                   ` Laurence Oberman
2018-01-18 21:39                   ` [dm-devel] " Bart Van Assche
2018-01-18 21:39                     ` Bart Van Assche
2018-01-18 21:45                     ` Laurence Oberman
2018-01-18 21:45                       ` Laurence Oberman
2018-01-18 22:01                     ` Mike Snitzer
2018-01-18 22:18                       ` Laurence Oberman
2018-01-18 22:20                         ` Laurence Oberman
2018-01-18 22:20                           ` Laurence Oberman
2018-01-18 22:24                         ` Bart Van Assche
2018-01-18 22:24                           ` Bart Van Assche
2018-01-18 22:35                           ` Laurence Oberman
2018-01-18 22:39                             ` Jens Axboe
2018-01-18 22:55                               ` Bart Van Assche
2018-01-18 22:55                                 ` Bart Van Assche
2018-01-18 22:20                       ` Bart Van Assche
2018-01-18 22:20                         ` Bart Van Assche
2018-01-23  9:22                         ` [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle) Mike Snitzer
2018-01-23 10:53                           ` Ming Lei
2018-01-23 12:15                             ` Mike Snitzer
2018-01-23 12:17                               ` Ming Lei
2018-01-23 12:43                                 ` Mike Snitzer
2018-01-23 16:43                           ` [PATCH] " Bart Van Assche
2018-01-23 16:43                             ` Bart Van Assche
2018-01-19  2:32             ` [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Ming Lei
2018-01-19  4:02               ` Jens Axboe
2018-01-19  7:26                 ` Ming Lei
2018-01-19 15:20                   ` Bart Van Assche
2018-01-19 15:20                     ` Bart Van Assche
2018-01-19 15:25                     ` Jens Axboe
2018-01-19 15:33                     ` Ming Lei
2018-01-19 16:06                       ` Bart Van Assche
2018-01-19 16:06                         ` Bart Van Assche
2018-01-19 15:24                   ` Jens Axboe
2018-01-19 15:40                     ` Ming Lei [this message]
2018-01-19 15:40                       ` Ming Lei
2018-01-19 15:48                       ` Jens Axboe
2018-01-19 16:05                         ` Ming Lei
2018-01-19 16:19                           ` Jens Axboe
2018-01-19 16:26                             ` Ming Lei
2018-01-19 16:27                               ` Jens Axboe
2018-01-19 16:37                                 ` Ming Lei
2018-01-19 16:41                                   ` Jens Axboe
2018-01-19 16:41                                     ` Jens Axboe
2018-01-19 16:47                                     ` Mike Snitzer
2018-01-19 16:52                                       ` Jens Axboe
2018-01-19 17:05                                         ` Ming Lei
2018-01-19 17:09                                           ` Jens Axboe
2018-01-19 17:20                                             ` Ming Lei
2018-01-19 17:38                                   ` Jens Axboe
2018-01-19 18:24                                     ` Ming Lei
2018-01-19 18:24                                       ` Ming Lei
2018-01-19 18:33                                     ` Mike Snitzer
2018-01-19 23:52                                     ` Ming Lei
2018-01-20  4:27                                       ` Jens Axboe
2018-01-19 16:13                         ` Mike Snitzer
2018-01-19 16:23                           ` Jens Axboe
2018-01-19 23:57                             ` Ming Lei
2018-01-29 22:37                     ` Bart Van Assche
2018-01-19  5:09               ` Bart Van Assche
2018-01-19  5:09                 ` Bart Van Assche
2018-01-19  7:34                 ` Ming Lei
2018-01-19 19:47                   ` Bart Van Assche
2018-01-19 19:47                     ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180119154047.GB14827@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=Bart.VanAssche@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=osandov@fb.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.