Re: Is congestion broken?

From: Jens Axboe <axboe@kernel.dk>
To: Matthew Wilcox <willy@infradead.org>
Cc: Lin Feng <linf@wangsu.com>, Michal Hocko <mhocko@kernel.org>,
	corbet@lwn.net, mcgrof@kernel.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	keescook@chromium.org, mchehab+samsung@kernel.org,
	mgorman@techsingularity.net, vbabka@suse.cz,
	ktkhai@virtuozzo.com, hannes@cmpxchg.org,
	Omar Sandoval <osandov@fb.com>, Ming Lei <ming.lei@redhat.com>
Subject: Re: Is congestion broken?
Date: Mon, 23 Sep 2019 13:51:21 -0600	[thread overview]
Message-ID: <ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk> (raw)
In-Reply-To: <20190923194509.GC1855@bombadil.infradead.org>

On 9/23/19 1:45 PM, Matthew Wilcox wrote:
> On Mon, Sep 23, 2019 at 01:38:23PM -0600, Jens Axboe wrote:
>> On 9/23/19 5:19 AM, Matthew Wilcox wrote:
>>>
>>> Ping Jens?
>>>
>>> On Wed, Sep 18, 2019 at 08:49:49PM -0700, Matthew Wilcox wrote:
>>>> On Thu, Sep 19, 2019 at 10:33:10AM +0800, Lin Feng wrote:
>>>>> On 9/18/19 20:33, Michal Hocko wrote:
>>>>>> I absolutely agree here. From you changelog it is also not clear what is
>>>>>> the underlying problem. Both congestion_wait and wait_iff_congested
>>>>>> should wake up early if the congestion is handled. Is this not the case?
>>>>>
>>>>> For now I don't know why, codes seem should work as you said, maybe I need to
>>>>> trace more of the internals.
>>>>> But weird thing is that once I set the people-disliked-tunable iowait
>>>>> drop down instantly, this is contradictory to the code design.
>>>>
>>>> Yes, this is quite strange.  If setting a smaller timeout makes a
>>>> difference, that indicates we're not waking up soon enough.  I see
>>>> two possibilities; one is that a wakeup is missing somewhere -- ie the
>>>> conditions under which we call clear_wb_congested() are wrong.  Or we
>>>> need to wake up sooner.
>>>>
>>>> Umm.  We have clear_wb_congested() called from exactly one spot --
>>>> clear_bdi_congested().  That is only called from:
>>>>
>>>> drivers/block/pktcdvd.c
>>>> fs/ceph/addr.c
>>>> fs/fuse/control.c
>>>> fs/fuse/dev.c
>>>> fs/nfs/write.c
>>>>
>>>> Jens, is something supposed to be calling clear_bdi_congested() in the
>>>> block layer?  blk_clear_congested() used to exist until October 29th
>>>> last year.  Or is something else supposed to be waking up tasks that
>>>> are sleeping on congestion?
>>
>> Congestion isn't there anymore. It was always broken as a concept imho,
>> since it was inherently racy. We used the old batching mechanism in the
>> legacy stack to signal it, and it only worked for some devices.
> 
> Umm.  OK.  Well, something that used to work is now broken.  So how

It didn't really...

> should we fix it?  Take a look at shrink_node() in mm/vmscan.c.  If we've
> submitted a lot of writes to a device, and overloaded it, we want to
> sleep until it's able to take more writes:
> 
>                  /*
>                   * Stall direct reclaim for IO completions if underlying BDIs
>                   * and node is congested. Allow kswapd to continue until it
>                   * starts encountering unqueued dirty pages or cycling through
>                   * the LRU too quickly.
>                   */
>                  if (!sc->hibernation_mode && !current_is_kswapd() &&
>                     current_may_throttle() && pgdat_memcg_congested(pgdat, root))
>                          wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> 
> With a standard block device, that now sleeps until the timeout (100ms)
> expires, which is far too long for a modern SSD but is probably tuned
> just right for some legacy piece of spinning rust (or indeed a modern
> USB stick).  How would the block layer like to indicate to the mm layer
> "I am too busy, please let the device work for a bit"?

Maybe base the sleep on the bdi write speed? We can't feasibly tell you
if something is congested. It used to sort of work on things like sata
drives, since we'd get congested when we hit the queue limit and that
wasn't THAT far off with reality. Didn't work on SCSI with higher queue
depths, and certainly doesn't work on NVMe where most devices have very
deep queues.

Or we can have something that does "sleep until X requests/MB have been
flushed", something that the vm would actively call. Combined with a
timeout as well, probably.

For the vm case above, it's further complicated by it being global
state. I think you'd be better off just making the delay smaller.  100ms
is an eternity, and 10ms wakeups isn't going to cause any major issues
in terms of CPU usage. If we're calling the above wait_iff_congested(),
it better because we're otherwise SOL.

-- 
Jens Axboe