linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lin Feng <linf@wangsu.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: corbet@lwn.net, mcgrof@kernel.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	keescook@chromium.org, mchehab+samsung@kernel.org,
	mgorman@techsingularity.net, vbabka@suse.cz, mhocko@suse.com,
	ktkhai@virtuozzo.com, hannes@cmpxchg.org
Subject: Re: [PATCH] [RFC] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length
Date: Wed, 18 Sep 2019 11:21:04 +0800	[thread overview]
Message-ID: <3fbb428e-9466-b56b-0be8-c0f510e3aa99@wangsu.com> (raw)
In-Reply-To: <20190917120646.GT29434@bombadil.infradead.org>



On 9/17/19 20:06, Matthew Wilcox wrote:
> On Tue, Sep 17, 2019 at 07:58:24PM +0800, Lin Feng wrote:
>> In direct and background(kswapd) pages reclaim paths both may fall into
>> calling msleep(100) or congestion_wait(HZ/10) or wait_iff_congested(HZ/10)
>> while under IO pressure, and the sleep length is hard-coded and the later
>> two will introduce 100ms iowait length per time.
>>
>> So if pages reclaim is relatively active in some circumstances such as high
>> order pages reappings, it's possible to see a lot of iowait introduced by
>> congestion_wait(HZ/10) and wait_iff_congested(HZ/10).
>>
>> The 100ms sleep length is proper if the backing drivers are slow like
>> traditionnal rotation disks. While if the backing drivers are high-end
>> storages such as high iops ssds or even faster drivers, the high iowait
>> inroduced by pages reclaim is really misleading, because the storage IO
>> utils seen by iostat is quite low, in this case the congestion_wait time
>> modified to 1ms is likely enough for high-end ssds.
>>
>> Another benifit is that it's potentially shorter the direct reclaim blocked
>> time when kernel falls into sync reclaim path, which may improve user
>> applications response time.
> 
> This is a great description of the problem.
The always 100ms blocked time sometimes is not necessary :)

> 
>> +mm_reclaim_congestion_wait_jiffies
>> +==========
>> +
>> +This control is used to define how long kernel will wait/sleep while
>> +system memory is under pressure and memroy reclaim is relatively active.
>> +Lower values will decrease the kernel wait/sleep time.
>> +
>> +It's suggested to lower this value on high-end box that system is under memory
>> +pressure but with low storage IO utils and high CPU iowait, which could also
>> +potentially decrease user application response time in this case.
>> +
>> +Keep this control as it were if your box are not above case.
>> +
>> +The default value is HZ/10, which is of equal value to 100ms independ of how
>> +many HZ is defined.
> 
> Adding a new tunable is not the right solution.  The right way is
> to make Linux auto-tune itself to avoid the problem.  For example,
> bdi_writeback contains an estimated write bandwidth (calculated by the
> memory management layer).  Given that, we should be able to make an
> estimate for how long to wait for the queues to drain.
> 

Yes, I had ever considered that, auto-tuning is definitely the senior AI way.
While considering all kinds of production environments hybird storage solution
is also common today, servers' dirty pages' bdi drivers can span from high end
ssds to low end sata disk, so we have to think of a *formula(AI core)* by using
the factors of dirty pages' amount and bdis' write bandwidth, and this AI-core
will depend on if the estimated write bandwidth is sane and moreover the to be
written back dirty pages is sequential or random if the bdi is rotational disk,
it's likey to give a not-sane number and hurt guys who dont't want that, while
if only consider ssd is relatively simple.

So IMHO it's not sane to brute force add a guessing logic into memory writeback
codes and pray on inventing a formula that caters everyone's need.
Add a sysctl entry may be a right choice that give people who need it and
doesn't hurt people who don't want it.

thanks,
linfeng



  reply	other threads:[~2019-09-18  3:21 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-17 11:58 [PATCH] [RFC] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length Lin Feng
2019-09-17 12:06 ` Matthew Wilcox
2019-09-18  3:21   ` Lin Feng [this message]
2019-09-18 11:38     ` Matthew Wilcox
2019-09-19  2:20       ` Lin Feng
2019-09-18 12:33   ` Michal Hocko
2019-09-19  2:33     ` Lin Feng
2019-09-19  3:49       ` Matthew Wilcox
2019-09-19  7:46         ` Lin Feng
2019-09-19  8:22           ` Michal Hocko
2019-09-23 11:19         ` Is congestion broken? Matthew Wilcox
2019-09-23 19:38           ` Jens Axboe
2019-09-23 19:45             ` Matthew Wilcox
2019-09-23 19:51               ` Jens Axboe
2019-09-24 12:16                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3fbb428e-9466-b56b-0be8-c0f510e3aa99@wangsu.com \
    --to=linf@wangsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=keescook@chromium.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=mchehab+samsung@kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).