archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <>
To: Matthew Wilcox <>
Cc: Lin Feng <>, Michal Hocko <>,,,,,,,,,,,,
	Omar Sandoval <>, Ming Lei <>
Subject: Re: Is congestion broken?
Date: Mon, 23 Sep 2019 13:51:21 -0600	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 9/23/19 1:45 PM, Matthew Wilcox wrote:
> On Mon, Sep 23, 2019 at 01:38:23PM -0600, Jens Axboe wrote:
>> On 9/23/19 5:19 AM, Matthew Wilcox wrote:
>>> Ping Jens?
>>> On Wed, Sep 18, 2019 at 08:49:49PM -0700, Matthew Wilcox wrote:
>>>> On Thu, Sep 19, 2019 at 10:33:10AM +0800, Lin Feng wrote:
>>>>> On 9/18/19 20:33, Michal Hocko wrote:
>>>>>> I absolutely agree here. From you changelog it is also not clear what is
>>>>>> the underlying problem. Both congestion_wait and wait_iff_congested
>>>>>> should wake up early if the congestion is handled. Is this not the case?
>>>>> For now I don't know why, codes seem should work as you said, maybe I need to
>>>>> trace more of the internals.
>>>>> But weird thing is that once I set the people-disliked-tunable iowait
>>>>> drop down instantly, this is contradictory to the code design.
>>>> Yes, this is quite strange.  If setting a smaller timeout makes a
>>>> difference, that indicates we're not waking up soon enough.  I see
>>>> two possibilities; one is that a wakeup is missing somewhere -- ie the
>>>> conditions under which we call clear_wb_congested() are wrong.  Or we
>>>> need to wake up sooner.
>>>> Umm.  We have clear_wb_congested() called from exactly one spot --
>>>> clear_bdi_congested().  That is only called from:
>>>> drivers/block/pktcdvd.c
>>>> fs/ceph/addr.c
>>>> fs/fuse/control.c
>>>> fs/fuse/dev.c
>>>> fs/nfs/write.c
>>>> Jens, is something supposed to be calling clear_bdi_congested() in the
>>>> block layer?  blk_clear_congested() used to exist until October 29th
>>>> last year.  Or is something else supposed to be waking up tasks that
>>>> are sleeping on congestion?
>> Congestion isn't there anymore. It was always broken as a concept imho,
>> since it was inherently racy. We used the old batching mechanism in the
>> legacy stack to signal it, and it only worked for some devices.
> Umm.  OK.  Well, something that used to work is now broken.  So how

It didn't really...

> should we fix it?  Take a look at shrink_node() in mm/vmscan.c.  If we've
> submitted a lot of writes to a device, and overloaded it, we want to
> sleep until it's able to take more writes:
>                  /*
>                   * Stall direct reclaim for IO completions if underlying BDIs
>                   * and node is congested. Allow kswapd to continue until it
>                   * starts encountering unqueued dirty pages or cycling through
>                   * the LRU too quickly.
>                   */
>                  if (!sc->hibernation_mode && !current_is_kswapd() &&
>                     current_may_throttle() && pgdat_memcg_congested(pgdat, root))
>                          wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> With a standard block device, that now sleeps until the timeout (100ms)
> expires, which is far too long for a modern SSD but is probably tuned
> just right for some legacy piece of spinning rust (or indeed a modern
> USB stick).  How would the block layer like to indicate to the mm layer
> "I am too busy, please let the device work for a bit"?

Maybe base the sleep on the bdi write speed? We can't feasibly tell you
if something is congested. It used to sort of work on things like sata
drives, since we'd get congested when we hit the queue limit and that
wasn't THAT far off with reality. Didn't work on SCSI with higher queue
depths, and certainly doesn't work on NVMe where most devices have very
deep queues.

Or we can have something that does "sleep until X requests/MB have been
flushed", something that the vm would actively call. Combined with a
timeout as well, probably.

For the vm case above, it's further complicated by it being global
state. I think you'd be better off just making the delay smaller.  100ms
is an eternity, and 10ms wakeups isn't going to cause any major issues
in terms of CPU usage. If we're calling the above wait_iff_congested(),
it better because we're otherwise SOL.

Jens Axboe

  reply	other threads:[~2019-09-23 19:51 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-17 11:58 [PATCH] [RFC] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length Lin Feng
2019-09-17 12:06 ` Matthew Wilcox
2019-09-18  3:21   ` Lin Feng
2019-09-18 11:38     ` Matthew Wilcox
2019-09-19  2:20       ` Lin Feng
2019-09-18 12:33   ` Michal Hocko
2019-09-19  2:33     ` Lin Feng
2019-09-19  3:49       ` Matthew Wilcox
2019-09-19  7:46         ` Lin Feng
2019-09-19  8:22           ` Michal Hocko
2019-09-23 11:19         ` Is congestion broken? Matthew Wilcox
2019-09-23 19:38           ` Jens Axboe
2019-09-23 19:45             ` Matthew Wilcox
2019-09-23 19:51               ` Jens Axboe [this message]
2019-09-24 12:16                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).