linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [RESEND] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length
@ 2019-09-18  9:51 Lin Feng
  2019-09-18 12:27 ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Lin Feng @ 2019-09-18  9:51 UTC (permalink / raw)
  To: corbet, mcgrof, akpm, linux-kernel, linux-mm
  Cc: keescook, mchehab+samsung, mgorman, vbabka, mhocko, ktkhai,
	hannes, linf, willy, kbuild test robot

This sysctl is named as mm_reclaim_congestion_wait_jiffies, default to
HZ/10 as unchanged to old codes.
It is in jiffies unit and can be set in range between [1, 100], so
refers to CONFIG_HZ before tuning.

In a high-end production environment(all high iops ssds) we found that
CPU iowait spikes a lot as server under memory pressure(a lot of order
2 or 3 pages allocations), but in the meantime IO pressure is nearly
none, await and util% seen by iostat are quite healthy.

In direct and background(kswapd) pages reclaim paths both may fall into
calling msleep(100) or congestion_wait(HZ/10) or wait_iff_congested(HZ/10)
while under IO pressure, and the sleep length is hard-coded and the later
two will introduce 100ms iowait length per time.

So if pages reclaim is relatively active in some circumstances such as high
order pages reappings, it's possible to see a lot of iowait introduced by
congestion_wait(HZ/10) and wait_iff_congested(HZ/10).

The 100ms sleep length is proper if the backing drivers are slow like
traditionnal rotation disks. While if the backing drivers are high-end
storages such as high iops ssds or even faster drivers, the high iowait
inroduced by pages reclaim is really misleading, because the storage IO
utils seen by iostat is quite low, in this case the congestion_wait time
modified to 1ms is likely enough for high-end ssds.

Another benifit is that it's potentially shorter the direct reclaim blocked
time when kernel falls into sync reclaim path, which may improve user
applications response time.

All ssds box is a trend, so introduce this sysctl entry for making a way
to relieving the concerns of system administrators.

Tested:
1. Before this patch:

top - 10:10:40 up 8 days, 16:22,  4 users,  load average: 2.21, 2.15, 2.10
Tasks: 718 total,   5 running, 712 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.3%us,  3.4%sy,  0.0%ni, 95.3%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.4%us,  1.7%sy,  0.0%ni, 95.2%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
Cpu2  :  4.7%us,  3.3%sy,  0.0%ni, 91.0%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu3  :  7.0%us,  3.7%sy,  0.0%ni, 87.7%id,  1.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu4  :  1.0%us,  2.0%sy,  0.0%ni, 96.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu5  :  1.0%us,  2.0%sy,  0.0%ni,  1.7%id, 95.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu6  :  1.0%us,  1.3%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :  1.3%us,  1.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  4.3%us,  1.3%sy,  0.0%ni, 94.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.7%us,  0.7%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu10 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  1.0%us,  1.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu12 :  3.0%us,  1.0%sy,  0.0%ni, 95.3%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu13 :  0.3%us,  1.3%sy,  0.0%ni, 88.6%id,  9.4%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu14 :  3.3%us,  2.3%sy,  0.0%ni, 93.7%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu15 :  6.4%us,  3.0%sy,  0.0%ni, 90.2%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu16 :  2.7%us,  1.7%sy,  0.0%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu17 :  1.0%us,  1.7%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  1.3%us,  1.0%sy,  0.0%ni, 97.0%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu19 :  4.3%us,  1.7%sy,  0.0%ni, 86.0%id,  7.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu20 :  0.7%us,  1.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu21 :  0.3%us,  1.7%sy,  0.0%ni, 50.2%id, 47.5%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu22 :  0.7%us,  0.7%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.7%us,  0.7%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

2. After this patch and set mm_reclaim_congestion_wait_jiffies to 1:

top - 10:12:19 up 8 days, 16:24,  4 users,  load average: 1.32, 1.93, 2.03
Tasks: 724 total,   2 running, 721 sleeping,   0 stopped,   1 zombie
Cpu0  :  4.4%us,  3.0%sy,  0.0%ni, 90.3%id,  1.3%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu1  :  2.1%us,  1.4%sy,  0.0%ni, 93.5%id,  0.7%wa,  0.0%hi,  2.4%si,  0.0%st
Cpu2  :  2.7%us,  1.0%sy,  0.0%ni, 96.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  1.0%us,  1.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  :  0.7%us,  1.0%sy,  0.0%ni, 97.7%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  :  1.0%us,  0.7%sy,  0.0%ni, 97.7%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu6  :  1.7%us,  1.0%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  2.0%us,  0.7%sy,  0.0%ni, 94.3%id,  2.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu8  :  2.0%us,  0.7%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 97.7%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.7%us,  0.3%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.7%us,  1.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu13 :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu14 :  1.7%us,  0.7%sy,  0.0%ni, 97.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  4.3%us,  1.0%sy,  0.0%ni, 94.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu16 :  1.7%us,  1.3%sy,  0.0%ni, 96.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu17 :  2.0%us,  1.3%sy,  0.0%ni, 96.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu18 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  1.0%us,  1.0%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu20 :  1.3%us,  0.7%sy,  0.0%ni, 97.0%id,  0.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu21 :  0.7%us,  0.7%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu22 :  1.0%us,  1.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.7%us,  0.3%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st

Chagelog:
V1: Fix a compile error reported by kbuild test robot and a checkpatch
error. Also more detailed the background for the commit log of this patch.

Signed-off-by: Lin Feng <linf@wangsu.com>
Reported-by: kbuild test robot <lkp@intel.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++
 kernel/sysctl.c                         | 10 ++++++++++
 mm/vmscan.c                             | 14 +++++++++++---
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 64aeee1009ca..fbe9a04583ac 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -837,6 +837,23 @@ than the high water mark in a zone.
 The default value is 60.
 
 
+mm_reclaim_congestion_wait_jiffies
+==========
+
+This control is used to define how long kernel will wait/sleep while
+system memory is under pressure and memroy reclaim is relatively active.
+Lower values will decrease the kernel wait/sleep time.
+
+It's suggested to lower this value on high-end box that system is under memory
+pressure but with low storage IO utils and high CPU iowait, which could also
+potentially decrease user application response time in this case.
+
+Keep this control as it were if your box is not above case.
+
+The default value is HZ/10, which is of equal value to 100ms independ of how
+many HZ is defined.
+
+
 unprivileged_userfaultfd
 ========================
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..5d02f137bdf9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -114,6 +114,7 @@ extern int pid_max;
 extern int pid_max_min, pid_max_max;
 extern int percpu_pagelist_fraction;
 extern int latencytop_enabled;
+extern int mm_reclaim_congestion_wait_jiffies;
 extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
@@ -1413,6 +1414,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &one_hundred,
 	},
+	{
+		.procname	= "mm_reclaim_congestion_wait_jiffies",
+		.data		= &mm_reclaim_congestion_wait_jiffies,
+		.maxlen		= sizeof(mm_reclaim_congestion_wait_jiffies),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+		.extra2		= &one_hundred,
+	},
 #ifdef CONFIG_HUGETLB_PAGE
 	{
 		.procname	= "nr_hugepages",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a6c5d0b28321..a1eab811772a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -165,6 +165,12 @@ struct scan_control {
  * From 0 .. 100.  Higher means more swappy.
  */
 int vm_swappiness = 60;
+
+/*
+ * From 0 .. 100.  Lower means shorter memory reclaim IO congestion wait time.
+ */
+int mm_reclaim_congestion_wait_jiffies = HZ / 10;
+
 /*
  * The total number of pages which are beyond the high watermark within all
  * zones.
@@ -1966,7 +1972,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			return 0;
 
 		/* wait a bit for the reclaimer. */
-		msleep(100);
+		msleep(jiffies_to_msecs(mm_reclaim_congestion_wait_jiffies));
 		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
@@ -2788,7 +2794,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			 * faster than they are written so also forcibly stall.
 			 */
 			if (sc->nr.immediate)
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				congestion_wait(BLK_RW_ASYNC,
+					mm_reclaim_congestion_wait_jiffies);
 		}
 
 		/*
@@ -2807,7 +2814,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 */
 		if (!sc->hibernation_mode && !current_is_kswapd() &&
 		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
-			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+			wait_iff_congested(BLK_RW_ASYNC,
+				mm_reclaim_congestion_wait_jiffies);
 
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] [RESEND] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length
  2019-09-18  9:51 [PATCH] [RESEND] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length Lin Feng
@ 2019-09-18 12:27 ` Michal Hocko
  2019-09-19  1:32   ` Lin Feng
  0 siblings, 1 reply; 4+ messages in thread
From: Michal Hocko @ 2019-09-18 12:27 UTC (permalink / raw)
  To: Lin Feng
  Cc: corbet, mcgrof, akpm, linux-kernel, linux-mm, keescook,
	mchehab+samsung, mgorman, vbabka, ktkhai, hannes, willy,
	kbuild test robot

Please do not post a new version with a minor compile fixes until there
is a general agreement on the approach. Willy had comments which really
need to be resolved first. And I do agree with him. Having an explicit
tunable seems just wrong.

Also does this
[...]
> Reported-by: kbuild test robot <lkp@intel.com>

really hold? Because it suggests that the problem has been spotted by
the kbuild bot which is kinda unexpected... I suspect you have just
added that for the minor compilation issue that you have fixed since the
last version.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [RESEND] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length
  2019-09-18 12:27 ` Michal Hocko
@ 2019-09-19  1:32   ` Lin Feng
  2019-09-19  8:23     ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Lin Feng @ 2019-09-19  1:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: corbet, mcgrof, akpm, linux-kernel, linux-mm, keescook,
	mchehab+samsung, mgorman, vbabka, ktkhai, hannes, willy,
	kbuild test robot



On 9/18/19 20:27, Michal Hocko wrote:
> Please do not post a new version with a minor compile fixes until there
> is a general agreement on the approach. Willy had comments which really
> need to be resolved first. 

Sorry, but thanks for pointing out.

> 
> Also does this
> [...]
>> Reported-by: kbuild test robot<lkp@intel.com>
> really hold? Because it suggests that the problem has been spotted by
> the kbuild bot which is kinda unexpected... I suspect you have just
> added that for the minor compilation issue that you have fixed since the
> last version.

Yes, I do know the issue is not reported by the robot, but
just followed the kbuild robot tip, this Reported-by suggested by kbuild robot
seems a little misleading, I'm not sure if it has other meanings.
'If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>'


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [RESEND] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length
  2019-09-19  1:32   ` Lin Feng
@ 2019-09-19  8:23     ` Michal Hocko
  0 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2019-09-19  8:23 UTC (permalink / raw)
  To: Lin Feng
  Cc: corbet, mcgrof, akpm, linux-kernel, linux-mm, keescook,
	mchehab+samsung, mgorman, vbabka, ktkhai, hannes, willy,
	kbuild test robot

On Thu 19-09-19 09:32:48, Lin Feng wrote:
> 
> 
> On 9/18/19 20:27, Michal Hocko wrote:
> > Please do not post a new version with a minor compile fixes until there
> > is a general agreement on the approach. Willy had comments which really
> > need to be resolved first.
> 
> Sorry, but thanks for pointing out.
> 
> > 
> > Also does this
> > [...]
> > > Reported-by: kbuild test robot<lkp@intel.com>
> > really hold? Because it suggests that the problem has been spotted by
> > the kbuild bot which is kinda unexpected... I suspect you have just
> > added that for the minor compilation issue that you have fixed since the
> > last version.
> 
> Yes, I do know the issue is not reported by the robot, but
> just followed the kbuild robot tip, this Reported-by suggested by kbuild robot
> seems a little misleading, I'm not sure if it has other meanings.
> 'If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>'

This would be normally the case for a patch which only fixes the
particular issue. You can credit the bot in the changelog while
documenting changes between version.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-19  8:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-18  9:51 [PATCH] [RESEND] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length Lin Feng
2019-09-18 12:27 ` Michal Hocko
2019-09-19  1:32   ` Lin Feng
2019-09-19  8:23     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).