All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Tariq Toukan <tariqt@mellanox.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>,
	David Miller <davem@davemloft.net>,
	Mel Gorman <mgorman@techsingularity.net>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Alexei Starovoitov <ast@fb.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, linux-mm <linux-mm@kvack.org>,
	Dave Hansen <dave.hansen@intel.com>
Subject: Re: Page allocator bottleneck
Date: Tue, 19 Sep 2017 15:23:43 +0800	[thread overview]
Message-ID: <20170919072342.GB7263@intel.com> (raw)
In-Reply-To: <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com>

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 10:44 AM, Aaron Lu wrote:
> > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:
> > > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
> > > > 
> > > > It's nice to have the option to dynamically play with the parameter.
> > > > But maybe we should also think of changing the default fraction guaranteed
> > > > to the PCP, so that unaware admins of networking servers would also benefit.
> > > 
> > > I collected some performance data with will-it-scale/page_fault1 process
> > > mode on different machines with different pcp->batch sizes, starting
> > > from the default 31(calculated by zone_batchsize(), 31 is the standard
> > > value for any zone that has more than 1/2MiB memory), then incremented
> > > by 31 upwards till 527. PCP's upper limit is 6*batch.
> > > 
> > > An image is plotted and attached: batch_full.png(full here means the
> > > number of process started equals to CPU number).
> > 
> > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
> > Y-axis is the value of per_process_ops, generated by will-it-scale,

One correction here, Y-axis isn't per_process_ops but per_process_ops *
nr_processes. Still, higher is better.

> > higher is better.
> > 
> > > 
> > >  From the image:
> > > - For EX machines, they all see throughput increase with increased batch
> > >    size and peaked at around batch_size=310, then fall;
> > > - For EP machines, Haswell-EP and Broadwell-EP also see throughput
> > >    increase with increased batch size and peaked at batch_size=279, then
> > >    fall, batch_size=310 also delivers pretty good result. Skylake-EP is
> > >    quite different in that it doesn't see any obvious throughput increase
> > >    after batch_size=93, though the trend is still increasing, but in a very
> > >    small way and finally peaked at batch_size=403, then fall.
> > >    Ivybridge EP behaves much like desktop ones.
> > > - For Desktop machines, they do not see any obvious changes with
> > >    increased batch_size.
> > > 
> > > So the default batch size(31) doesn't deliver good enough result, we
> > > probbaly should change the default value.
> 
> Thanks Aaron for sharing your experiment results.
> That's a good analysis of the effect of the batch value.
> I agree with your conclusion.
> 
> From networking perspective, we should reconsider the defaults to be able to
> reach the increasing NICs linerates.
> Not only for pcp->batch, but also for pcp->high.

I guess I didn't make it clear in my last email: when pcp->batch is
changed, pcp->high is also changed. Their relationship is:
pcp->high = pcp->batch * 6.

Manipulating percpu_pagelist_fraction could increase pcp->high, but not
pcp->batch(it has an upper limit as 96 currently).

My test shows even when pcp->high being the same, changing pcp->batch
could further improve will-it-scale's performance. e.g. in the below two
cases, pcp->high are both set to 1860 but with different pcp->batch:

                 will-it-scale    native_queued_spin_lock_slowpath(perf)
pcp->batch=96    15762348         79.95%
pcp->batch=310   19291492 +22.3%  74.87% -5.1%

Granted, this is the case for will-it-scale and may not apply to your
case. I have a small patch that adds a batch interface for debug
purpose, echo a value could set batch and high will be batch * 6. You
are welcome to give it a try if you think it's worth(attached).

Regards,
Aaron

[-- Attachment #2: 0001-percpu_pagelist_batch-add-a-batch-interface.patch --]
[-- Type: text/plain, Size: 3764 bytes --]

>From e3c9516beb8302cb8fb2f5ab866bbe2686fda5fb Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Thu, 6 Jul 2017 15:00:07 +0800
Subject: [PATCH] percpu_pagelist_batch: add a batch interface

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/linux/mmzone.h |  2 ++
 kernel/sysctl.c        |  9 +++++++++
 mm/page_alloc.c        | 40 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef6a13b7bd3e..0548d038b7cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -875,6 +875,8 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int percpu_pagelist_batch_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a76cc3..85cc4544db1b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -108,6 +108,7 @@ extern unsigned int core_pipe_limit;
 extern int pid_max;
 extern int pid_max_min, pid_max_max;
 extern int percpu_pagelist_fraction;
+extern int percpu_pagelist_batch;
 extern int latencytop_enabled;
 extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
@@ -1440,6 +1441,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= percpu_pagelist_fraction_sysctl_handler,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "percpu_pagelist_batch",
+		.data		= &percpu_pagelist_batch,
+		.maxlen		= sizeof(percpu_pagelist_batch),
+		.mode		= 0644,
+		.proc_handler	= percpu_pagelist_batch_sysctl_handler,
+		.extra1		= &zero,
+	},
 #ifdef CONFIG_MMU
 	{
 		.procname	= "max_map_count",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f250d6b1..aa96a4bd6467 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,6 +129,7 @@ unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;
 
 int percpu_pagelist_fraction;
+int percpu_pagelist_batch;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
 /*
@@ -5477,7 +5478,8 @@ static void pageset_set_high_and_batch(struct zone *zone,
 			(zone->managed_pages /
 				percpu_pagelist_fraction));
 	else
-		pageset_set_batch(pcp, zone_batchsize(zone));
+		pageset_set_batch(pcp, percpu_pagelist_batch ?
+				percpu_pagelist_batch : zone_batchsize(zone));
 }
 
 static void __meminit zone_pageset_init(struct zone *zone, int cpu)
@@ -7157,6 +7159,42 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int percpu_pagelist_batch_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	int old_percpu_pagelist_batch;
+	int ret;
+
+	mutex_lock(&pcp_batch_high_lock);
+	old_percpu_pagelist_batch = percpu_pagelist_batch;
+
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (!write || ret < 0)
+		goto out;
+
+	/* Sanity checking to avoid pcp imbalance */
+	if (percpu_pagelist_batch <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* No change? */
+	if (percpu_pagelist_batch == old_percpu_pagelist_batch)
+		goto out;
+
+	for_each_populated_zone(zone) {
+		unsigned int cpu;
+
+		for_each_possible_cpu(cpu)
+			pageset_set_high_and_batch(zone,
+					per_cpu_ptr(zone->pageset, cpu));
+	}
+out:
+	mutex_unlock(&pcp_batch_high_lock);
+	return ret;
+}
+
 #ifdef CONFIG_NUMA
 int hashdist = HASHDIST_DEFAULT;
 
-- 
2.9.5


WARNING: multiple messages have this Message-ID (diff)
From: Aaron Lu <aaron.lu@intel.com>
To: Tariq Toukan <tariqt@mellanox.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>,
	David Miller <davem@davemloft.net>,
	Mel Gorman <mgorman@techsingularity.net>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Alexei Starovoitov <ast@fb.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, linux-mm <linux-mm@kvack.org>,
	Dave Hansen <dave.hansen@intel.com>
Subject: Re: Page allocator bottleneck
Date: Tue, 19 Sep 2017 15:23:43 +0800	[thread overview]
Message-ID: <20170919072342.GB7263@intel.com> (raw)
In-Reply-To: <082e7901-7842-e9d9-221d-45322da0fcff@mellanox.com>

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 10:44 AM, Aaron Lu wrote:
> > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:
> > > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
> > > > 
> > > > It's nice to have the option to dynamically play with the parameter.
> > > > But maybe we should also think of changing the default fraction guaranteed
> > > > to the PCP, so that unaware admins of networking servers would also benefit.
> > > 
> > > I collected some performance data with will-it-scale/page_fault1 process
> > > mode on different machines with different pcp->batch sizes, starting
> > > from the default 31(calculated by zone_batchsize(), 31 is the standard
> > > value for any zone that has more than 1/2MiB memory), then incremented
> > > by 31 upwards till 527. PCP's upper limit is 6*batch.
> > > 
> > > An image is plotted and attached: batch_full.png(full here means the
> > > number of process started equals to CPU number).
> > 
> > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
> > Y-axis is the value of per_process_ops, generated by will-it-scale,

One correction here, Y-axis isn't per_process_ops but per_process_ops *
nr_processes. Still, higher is better.

> > higher is better.
> > 
> > > 
> > >  From the image:
> > > - For EX machines, they all see throughput increase with increased batch
> > >    size and peaked at around batch_size=310, then fall;
> > > - For EP machines, Haswell-EP and Broadwell-EP also see throughput
> > >    increase with increased batch size and peaked at batch_size=279, then
> > >    fall, batch_size=310 also delivers pretty good result. Skylake-EP is
> > >    quite different in that it doesn't see any obvious throughput increase
> > >    after batch_size=93, though the trend is still increasing, but in a very
> > >    small way and finally peaked at batch_size=403, then fall.
> > >    Ivybridge EP behaves much like desktop ones.
> > > - For Desktop machines, they do not see any obvious changes with
> > >    increased batch_size.
> > > 
> > > So the default batch size(31) doesn't deliver good enough result, we
> > > probbaly should change the default value.
> 
> Thanks Aaron for sharing your experiment results.
> That's a good analysis of the effect of the batch value.
> I agree with your conclusion.
> 
> From networking perspective, we should reconsider the defaults to be able to
> reach the increasing NICs linerates.
> Not only for pcp->batch, but also for pcp->high.

I guess I didn't make it clear in my last email: when pcp->batch is
changed, pcp->high is also changed. Their relationship is:
pcp->high = pcp->batch * 6.

Manipulating percpu_pagelist_fraction could increase pcp->high, but not
pcp->batch(it has an upper limit as 96 currently).

My test shows even when pcp->high being the same, changing pcp->batch
could further improve will-it-scale's performance. e.g. in the below two
cases, pcp->high are both set to 1860 but with different pcp->batch:

                 will-it-scale    native_queued_spin_lock_slowpath(perf)
pcp->batch=96    15762348         79.95%
pcp->batch=310   19291492 +22.3%  74.87% -5.1%

Granted, this is the case for will-it-scale and may not apply to your
case. I have a small patch that adds a batch interface for debug
purpose, echo a value could set batch and high will be batch * 6. You
are welcome to give it a try if you think it's worth(attached).

Regards,
Aaron

[-- Attachment #2: 0001-percpu_pagelist_batch-add-a-batch-interface.patch --]
[-- Type: text/plain, Size: 0 bytes --]



  reply	other threads:[~2017-09-19  7:23 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-14 16:49 Page allocator bottleneck Tariq Toukan
2017-09-14 16:49 ` Tariq Toukan
2017-09-14 20:19 ` Andi Kleen
2017-09-14 20:19   ` Andi Kleen
2017-09-17 15:43   ` Tariq Toukan
2017-09-15  7:28 ` Jesper Dangaard Brouer
2017-09-17 16:16   ` Tariq Toukan
2017-09-18  7:34     ` Aaron Lu
2017-09-18  7:44       ` Aaron Lu
2017-09-18 15:33         ` Tariq Toukan
2017-09-19  7:23           ` Aaron Lu [this message]
2017-09-19  7:23             ` Aaron Lu
2017-09-15 10:23 ` Mel Gorman
2017-09-18  9:16   ` Tariq Toukan
2017-11-02 17:21     ` Tariq Toukan
2017-11-02 17:21       ` Tariq Toukan
2017-11-03 13:40       ` Mel Gorman
2017-11-08  5:42         ` Tariq Toukan
2017-11-08  5:42           ` Tariq Toukan
2017-11-08  9:35           ` Mel Gorman
2017-11-09  3:51             ` Figo.zhang
2017-11-09  5:06             ` Tariq Toukan
2017-11-09  5:21             ` Jesper Dangaard Brouer
2017-11-09  5:21               ` Jesper Dangaard Brouer
2018-04-21  8:15       ` Aaron Lu
2018-04-22 16:43         ` Tariq Toukan
2018-04-23  8:54           ` Tariq Toukan
2018-04-23  8:54             ` Tariq Toukan
2018-04-23 13:10             ` Aaron Lu
2018-04-27  8:45               ` Aaron Lu
2018-05-02 13:38                 ` Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170919072342.GB7263@intel.com \
    --to=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@fb.com \
    --cc=brouer@redhat.com \
    --cc=dave.hansen@intel.com \
    --cc=davem@davemloft.net \
    --cc=eranbe@mellanox.com \
    --cc=eric.dumazet@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.