linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
@ 2015-05-19 10:40 Mel Gorman
  2015-05-19 14:18 ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2015-05-19 10:40 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko; +Cc: linux-mm, linux-kernel, Andrew Morton

memcg was reported years ago to have significant overhead when unused. It
has improved but it's still the case that users that have no knowledge of
memcg pay a performance penalty.

This patch adds a Kconfig that controls whether memcg is enabled by default
and a kernel parameter cgroup_enable= to enable it if desired. Anyone using
oldconfig will get the historical behaviour. It is not an option for most
distributions to simply disable MEMCG as there are users that require it
but they should also be knowledgable enough to use cgroup_enable=.

This was evaluated using aim9, a page fault microbenchmark and ebizzy
but I'll focus on the page fault microbenchmark. It can be reproduced
using pft from mmtests (https://github.com/gormanm/mmtests).  Edit
configs/config-global-dhp__pagealloc-performance and update MMTESTS to
only contain pft. This is the relevant part of the profile summary

/usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
  mem_cgroup_try_charge                                                        2.950%   175781
  __mem_cgroup_count_vm_event                                                  1.431%    85239
  mem_cgroup_page_lruvec                                                       0.456%    27156
  mem_cgroup_commit_charge                                                     0.392%    23342
  uncharge_list                                                                0.323%    19256
  mem_cgroup_update_lru_size                                                   0.278%    16538
  memcg_check_events                                                           0.216%    12858
  mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
  try_charge                                                                   0.150%     8928
  commit_charge                                                                0.141%     8388
  get_mem_cgroup_from_mm                                                       0.121%     7184

It's showing 6.64% overhead in memcontrol.c when no memcgs are in
use. Applying the patch and disabling memcg reduces this to 0.48%

/usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c                      0.4834    27511
  mem_cgroup_page_lruvec                                                       0.161%     9172
  mem_cgroup_update_lru_size                                                   0.154%     8794
  mem_cgroup_try_charge                                                        0.126%     7194
  mem_cgroup_commit_charge                                                     0.041%     2351

Note that it's not very visible from headline performance figures

pft faults
                                       4.0.0                  4.0.0
                                     vanilla             nomemcg-v1
Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1530574.6033 (  6.05%)
Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1375156.5834 (  2.59%)
Hmean    faults/cpu-5  875599.0222 (  0.00%)  876217.9211 (  0.07%)
Hmean    faults/cpu-7  601146.6726 (  0.00%)  599068.4360 ( -0.35%)
Hmean    faults/cpu-8  510728.2754 (  0.00%)  509887.9960 ( -0.16%)
Hmean    faults/sec-1 1432084.7845 (  0.00%) 1518566.3541 (  6.04%)
Hmean    faults/sec-3 3943818.1437 (  0.00%) 4036918.0217 (  2.36%)
Hmean    faults/sec-5 3877573.5867 (  0.00%) 3922745.9207 (  1.16%)
Hmean    faults/sec-7 3991832.0418 (  0.00%) 3990670.8481 ( -0.03%)
Hmean    faults/sec-8 3987189.8167 (  0.00%) 3978842.8107 ( -0.21%)

Low thread counts get a boost but it's within noise as memcg overhead does
not dominate.  It's not obvious at all at higher thread counts as other
factors cause more problems. The overall breakdown of CPU usage looks like

               4.0.0       4.0.0
             vanilla  nomemcg-v1
User           41.45       41.11
System        410.19      404.76
Elapsed       130.33      126.30

Despite the relative unimportance, there is at least some justification
for disabling memcg by default.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/kernel-parameters.txt |  4 ++++
 init/Kconfig                        | 15 +++++++++++++++
 kernel/cgroup.c                     | 20 ++++++++++++++++----
 mm/memcontrol.c                     |  3 +++
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a62a7b4..4f264f906816 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -591,6 +591,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			cut the overhead, others just disable the usage. So
 			only cgroup_disable=memory is actually worthy}
 
+	cgroup_enable= [KNL] Enable a particular controller
+			Similar to cgroup_disable except that it enables
+			controllers that are disabled by default.
+
 	checkreqprot	[SELINUX] Set initial checkreqprot flag value.
 			Format: { "0" | "1" }
 			See security/selinux/Kconfig help text.
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d4261b..819b6cc05cba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -990,6 +990,21 @@ config MEMCG
 	  Provides a memory resource controller that manages both anonymous
 	  memory and page cache. (See Documentation/cgroups/memory.txt)
 
+config MEMCG_DEFAULT_ENABLED
+	bool "Automatically enable memory resource controller"
+	default y
+	depends on MEMCG
+	help
+	  The memory controller has some overhead even if idle as resource
+	  usage must be tracked in case a group is created and a process
+	  migrated. As users may not be aware of this and the cgroup_disable=
+	  option, this config option controls whether it is enabled by
+	  default. It is assumed that someone that requires the controller
+	  can find the cgroup_enable= switch.
+
+	  Say N if unsure. This is default Y to preserve oldconfig and
+	  historical behaviour.
+
 config MEMCG_SWAP
 	bool "Memory Resource Controller Swap Extension"
 	depends on MEMCG && SWAP
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 29a7b2cc593e..0e79db55bf1a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5370,7 +5370,7 @@ out_free:
 	kfree(pathbuf);
 }
 
-static int __init cgroup_disable(char *str)
+static int __init __cgroup_set_state(char *str, bool disabled)
 {
 	struct cgroup_subsys *ss;
 	char *token;
@@ -5382,16 +5382,28 @@ static int __init cgroup_disable(char *str)
 
 		for_each_subsys(ss, i) {
 			if (!strcmp(token, ss->name)) {
-				ss->disabled = 1;
-				printk(KERN_INFO "Disabling %s control group"
-					" subsystem\n", ss->name);
+				ss->disabled = disabled;
+				printk(KERN_INFO "Setting %s control group"
+					" subsystem %s\n", ss->name,
+					disabled ? "disabled" : "enabled");
 				break;
 			}
 		}
 	}
 	return 1;
 }
+
+static int __init cgroup_disable(char *str)
+{
+	return __cgroup_set_state(str, true);
+}
+
+static int __init cgroup_enable(char *str)
+{
+	return __cgroup_set_state(str, false);
+}
 __setup("cgroup_disable=", cgroup_disable);
+__setup("cgroup_enable=", cgroup_enable);
 
 static int __init cgroup_set_legacy_files_on_dfl(char *str)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b34ef4a32a3b..ce171ba16949 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5391,6 +5391,9 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
+#ifndef CONFIG_MEMCG_DEFAULT_ENABLED
+	.disabled = 1,
+#endif
 };
 
 /**

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 10:40 [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig Mel Gorman
@ 2015-05-19 14:18 ` Johannes Weiner
  2015-05-19 14:43   ` Mel Gorman
  2015-05-19 14:53   ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2015-05-19 14:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrew Morton, Tejun Heo, cgroups

CC'ing Tejun and cgroups for the generic cgroup interface part

On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> memcg was reported years ago to have significant overhead when unused. It
> has improved but it's still the case that users that have no knowledge of
> memcg pay a performance penalty.
> 
> This patch adds a Kconfig that controls whether memcg is enabled by default
> and a kernel parameter cgroup_enable= to enable it if desired. Anyone using
> oldconfig will get the historical behaviour. It is not an option for most
> distributions to simply disable MEMCG as there are users that require it
> but they should also be knowledgable enough to use cgroup_enable=.
> 
> This was evaluated using aim9, a page fault microbenchmark and ebizzy
> but I'll focus on the page fault microbenchmark. It can be reproduced
> using pft from mmtests (https://github.com/gormanm/mmtests).  Edit
> configs/config-global-dhp__pagealloc-performance and update MMTESTS to
> only contain pft. This is the relevant part of the profile summary
> 
> /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
>   mem_cgroup_try_charge                                                        2.950%   175781

Ouch.  Do you have a way to get the per-instruction breakdown of this?
This function really isn't doing much.  I'll try to reproduce it here
too, I haven't seen such high costs with pft in the past.

>   __mem_cgroup_count_vm_event                                                  1.431%    85239
>   mem_cgroup_page_lruvec                                                       0.456%    27156
>   mem_cgroup_commit_charge                                                     0.392%    23342
>   uncharge_list                                                                0.323%    19256
>   mem_cgroup_update_lru_size                                                   0.278%    16538
>   memcg_check_events                                                           0.216%    12858
>   mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
>   try_charge                                                                   0.150%     8928
>   commit_charge                                                                0.141%     8388
>   get_mem_cgroup_from_mm                                                       0.121%     7184
> 
> It's showing 6.64% overhead in memcontrol.c when no memcgs are in
> use. Applying the patch and disabling memcg reduces this to 0.48%

The frustrating part is that 4.5% of that is not even coming from the
main accounting and tracking work.  I'm looking into getting this
fixed regardless of what happens with this patch.

> /usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c                      0.4834    27511
>   mem_cgroup_page_lruvec                                                       0.161%     9172
>   mem_cgroup_update_lru_size                                                   0.154%     8794
>   mem_cgroup_try_charge                                                        0.126%     7194
>   mem_cgroup_commit_charge                                                     0.041%     2351
> 
> Note that it's not very visible from headline performance figures
> 
> pft faults
>                                        4.0.0                  4.0.0
>                                      vanilla             nomemcg-v1
> Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1530574.6033 (  6.05%)
> Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1375156.5834 (  2.59%)
> Hmean    faults/cpu-5  875599.0222 (  0.00%)  876217.9211 (  0.07%)
> Hmean    faults/cpu-7  601146.6726 (  0.00%)  599068.4360 ( -0.35%)
> Hmean    faults/cpu-8  510728.2754 (  0.00%)  509887.9960 ( -0.16%)
> Hmean    faults/sec-1 1432084.7845 (  0.00%) 1518566.3541 (  6.04%)
> Hmean    faults/sec-3 3943818.1437 (  0.00%) 4036918.0217 (  2.36%)
> Hmean    faults/sec-5 3877573.5867 (  0.00%) 3922745.9207 (  1.16%)
> Hmean    faults/sec-7 3991832.0418 (  0.00%) 3990670.8481 ( -0.03%)
> Hmean    faults/sec-8 3987189.8167 (  0.00%) 3978842.8107 ( -0.21%)
> 
> Low thread counts get a boost but it's within noise as memcg overhead does
> not dominate.  It's not obvious at all at higher thread counts as other
> factors cause more problems. The overall breakdown of CPU usage looks like
> 
>                4.0.0       4.0.0
>              vanilla  nomemcg-v1
> User           41.45       41.11
> System        410.19      404.76
> Elapsed       130.33      126.30
> 
> Despite the relative unimportance, there is at least some justification
> for disabling memcg by default.

I guess so.  The only thing I don't like about this is that it changes
the default of a single controller.  While there is some justification
from an overhead standpoint, it's a little weird in terms of interface
when you boot, say, a distribution kernel and it has cgroups with all
but one resource controller available.

Would it make more sense to provide a Kconfig option that disables all
resource controllers per default?  There is still value in having only
the generic cgroup part for grouped process monitoring and control.

Thanks,
Johannes

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/kernel-parameters.txt |  4 ++++
>  init/Kconfig                        | 15 +++++++++++++++
>  kernel/cgroup.c                     | 20 ++++++++++++++++----
>  mm/memcontrol.c                     |  3 +++
>  4 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index bfcb1a62a7b4..4f264f906816 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -591,6 +591,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			cut the overhead, others just disable the usage. So
>  			only cgroup_disable=memory is actually worthy}
>  
> +	cgroup_enable= [KNL] Enable a particular controller
> +			Similar to cgroup_disable except that it enables
> +			controllers that are disabled by default.
> +
>  	checkreqprot	[SELINUX] Set initial checkreqprot flag value.
>  			Format: { "0" | "1" }
>  			See security/selinux/Kconfig help text.
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d4261b..819b6cc05cba 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -990,6 +990,21 @@ config MEMCG
>  	  Provides a memory resource controller that manages both anonymous
>  	  memory and page cache. (See Documentation/cgroups/memory.txt)
>  
> +config MEMCG_DEFAULT_ENABLED
> +	bool "Automatically enable memory resource controller"
> +	default y
> +	depends on MEMCG
> +	help
> +	  The memory controller has some overhead even if idle as resource
> +	  usage must be tracked in case a group is created and a process
> +	  migrated. As users may not be aware of this and the cgroup_disable=
> +	  option, this config option controls whether it is enabled by
> +	  default. It is assumed that someone that requires the controller
> +	  can find the cgroup_enable= switch.
> +
> +	  Say N if unsure. This is default Y to preserve oldconfig and
> +	  historical behaviour.
> +
>  config MEMCG_SWAP
>  	bool "Memory Resource Controller Swap Extension"
>  	depends on MEMCG && SWAP
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 29a7b2cc593e..0e79db55bf1a 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -5370,7 +5370,7 @@ out_free:
>  	kfree(pathbuf);
>  }
>  
> -static int __init cgroup_disable(char *str)
> +static int __init __cgroup_set_state(char *str, bool disabled)
>  {
>  	struct cgroup_subsys *ss;
>  	char *token;
> @@ -5382,16 +5382,28 @@ static int __init cgroup_disable(char *str)
>  
>  		for_each_subsys(ss, i) {
>  			if (!strcmp(token, ss->name)) {
> -				ss->disabled = 1;
> -				printk(KERN_INFO "Disabling %s control group"
> -					" subsystem\n", ss->name);
> +				ss->disabled = disabled;
> +				printk(KERN_INFO "Setting %s control group"
> +					" subsystem %s\n", ss->name,
> +					disabled ? "disabled" : "enabled");
>  				break;
>  			}
>  		}
>  	}
>  	return 1;
>  }
> +
> +static int __init cgroup_disable(char *str)
> +{
> +	return __cgroup_set_state(str, true);
> +}
> +
> +static int __init cgroup_enable(char *str)
> +{
> +	return __cgroup_set_state(str, false);
> +}
>  __setup("cgroup_disable=", cgroup_disable);
> +__setup("cgroup_enable=", cgroup_enable);
>  
>  static int __init cgroup_set_legacy_files_on_dfl(char *str)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b34ef4a32a3b..ce171ba16949 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5391,6 +5391,9 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.dfl_cftypes = memory_files,
>  	.legacy_cftypes = mem_cgroup_legacy_files,
>  	.early_init = 0,
> +#ifndef CONFIG_MEMCG_DEFAULT_ENABLED
> +	.disabled = 1,
> +#endif
>  };
>  
>  /**

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 14:18 ` Johannes Weiner
@ 2015-05-19 14:43   ` Mel Gorman
  2015-05-19 15:15     ` Michal Hocko
  2015-05-19 14:53   ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2015-05-19 14:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrew Morton, Tejun Heo, cgroups

On Tue, May 19, 2015 at 10:18:07AM -0400, Johannes Weiner wrote:
> CC'ing Tejun and cgroups for the generic cgroup interface part
> 
> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> > memcg was reported years ago to have significant overhead when unused. It
> > has improved but it's still the case that users that have no knowledge of
> > memcg pay a performance penalty.
> > 
> > This patch adds a Kconfig that controls whether memcg is enabled by default
> > and a kernel parameter cgroup_enable= to enable it if desired. Anyone using
> > oldconfig will get the historical behaviour. It is not an option for most
> > distributions to simply disable MEMCG as there are users that require it
> > but they should also be knowledgable enough to use cgroup_enable=.
> > 
> > This was evaluated using aim9, a page fault microbenchmark and ebizzy
> > but I'll focus on the page fault microbenchmark. It can be reproduced
> > using pft from mmtests (https://github.com/gormanm/mmtests).  Edit
> > configs/config-global-dhp__pagealloc-performance and update MMTESTS to
> > only contain pft. This is the relevant part of the profile summary
> > 
> > /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
> >   mem_cgroup_try_charge                                                        2.950%   175781
> 
> Ouch.  Do you have a way to get the per-instruction breakdown of this?

Not that I can upload in a reasonable amount of time. An annotated profile
and vmlinux image for decoding addresses is not small. My expectation is
that it'd be trivially reproducible.

> This function really isn't doing much.  I'll try to reproduce it here
> too, I haven't seen such high costs with pft in the past.
> 

I don't believe it's the machine that is being particularly stupid. It's
a fairly bog-standard desktop class box.

> >   __mem_cgroup_count_vm_event                                                  1.431%    85239
> >   mem_cgroup_page_lruvec                                                       0.456%    27156
> >   mem_cgroup_commit_charge                                                     0.392%    23342
> >   uncharge_list                                                                0.323%    19256
> >   mem_cgroup_update_lru_size                                                   0.278%    16538
> >   memcg_check_events                                                           0.216%    12858
> >   mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
> >   try_charge                                                                   0.150%     8928
> >   commit_charge                                                                0.141%     8388
> >   get_mem_cgroup_from_mm                                                       0.121%     7184
> > 
> > It's showing 6.64% overhead in memcontrol.c when no memcgs are in
> > use. Applying the patch and disabling memcg reduces this to 0.48%
> 
> The frustrating part is that 4.5% of that is not even coming from the
> main accounting and tracking work.  I'm looking into getting this
> fixed regardless of what happens with this patch.
> 
> > /usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c                      0.4834    27511
> >   mem_cgroup_page_lruvec                                                       0.161%     9172
> >   mem_cgroup_update_lru_size                                                   0.154%     8794
> >   mem_cgroup_try_charge                                                        0.126%     7194
> >   mem_cgroup_commit_charge                                                     0.041%     2351
> > 
> > Note that it's not very visible from headline performance figures
> > 
> > pft faults
> >                                        4.0.0                  4.0.0
> >                                      vanilla             nomemcg-v1
> > Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1530574.6033 (  6.05%)
> > Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1375156.5834 (  2.59%)
> > Hmean    faults/cpu-5  875599.0222 (  0.00%)  876217.9211 (  0.07%)
> > Hmean    faults/cpu-7  601146.6726 (  0.00%)  599068.4360 ( -0.35%)
> > Hmean    faults/cpu-8  510728.2754 (  0.00%)  509887.9960 ( -0.16%)
> > Hmean    faults/sec-1 1432084.7845 (  0.00%) 1518566.3541 (  6.04%)
> > Hmean    faults/sec-3 3943818.1437 (  0.00%) 4036918.0217 (  2.36%)
> > Hmean    faults/sec-5 3877573.5867 (  0.00%) 3922745.9207 (  1.16%)
> > Hmean    faults/sec-7 3991832.0418 (  0.00%) 3990670.8481 ( -0.03%)
> > Hmean    faults/sec-8 3987189.8167 (  0.00%) 3978842.8107 ( -0.21%)
> > 
> > Low thread counts get a boost but it's within noise as memcg overhead does
> > not dominate.  It's not obvious at all at higher thread counts as other
> > factors cause more problems. The overall breakdown of CPU usage looks like
> > 
> >                4.0.0       4.0.0
> >              vanilla  nomemcg-v1
> > User           41.45       41.11
> > System        410.19      404.76
> > Elapsed       130.33      126.30
> > 
> > Despite the relative unimportance, there is at least some justification
> > for disabling memcg by default.
> 
> I guess so.  The only thing I don't like about this is that it changes
> the default of a single controller.  While there is some justification
> from an overhead standpoint, it's a little weird in terms of interface
> when you boot, say, a distribution kernel and it has cgroups with all
> but one resource controller available.
> 
> Would it make more sense to provide a Kconfig option that disables all
> resource controllers per default?  There is still value in having only
> the generic cgroup part for grouped process monitoring and control.
> 

A config option per controller seems overkill because AFAIK the other
controllers are harmless in terms of overhead. All enabled or all
disabled has other consequences because AFAIK systemd requires some
controllers to function correctly -- e.g.
https://bugs.freedesktop.org/show_bug.cgi?id=74589

After I wrote the patch, I spotted that Debian apparently already
does something like this and by coincidence they matched the
parameter name and values. See the memory controller instructions on
https://wiki.debian.org/LXC#Prepare_the_host . So in this case at least
upstream would match something that at least one distro in the field
already uses.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 14:18 ` Johannes Weiner
  2015-05-19 14:43   ` Mel Gorman
@ 2015-05-19 14:53   ` Michal Hocko
  2015-05-19 15:12     ` Vlastimil Babka
  2015-05-19 15:13     ` Mel Gorman
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2015-05-19 14:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, linux-mm, linux-kernel, Andrew Morton, Tejun Heo, cgroups

On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
> CC'ing Tejun and cgroups for the generic cgroup interface part
> 
> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
[...]
> > /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
> >   mem_cgroup_try_charge                                                        2.950%   175781
> 
> Ouch.  Do you have a way to get the per-instruction breakdown of this?
> This function really isn't doing much.  I'll try to reproduce it here
> too, I haven't seen such high costs with pft in the past.
> 
> >   try_charge                                                                   0.150%     8928
> >   get_mem_cgroup_from_mm                                                       0.121%     7184

Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
the biggest consumers here are below 10% of the mem_cgroup_try_charge.
Other than that the function doesn't do much else than some flags
queries and css_put...

Do you have the full trace? Sorry for a stupid question but do inlines
from other header files get accounted to memcontrol.c?

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 14:53   ` Michal Hocko
@ 2015-05-19 15:12     ` Vlastimil Babka
  2015-05-19 15:13     ` Mel Gorman
  1 sibling, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2015-05-19 15:12 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Mel Gorman, linux-mm, linux-kernel, Andrew Morton, Tejun Heo, cgroups

On 05/19/2015 04:53 PM, Michal Hocko wrote:
> On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
>> CC'ing Tejun and cgroups for the generic cgroup interface part
>>
>> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> [...]
>>> /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
>>>    mem_cgroup_try_charge                                                        2.950%   175781
>>
>> Ouch.  Do you have a way to get the per-instruction breakdown of this?
>> This function really isn't doing much.  I'll try to reproduce it here
>> too, I haven't seen such high costs with pft in the past.
>>
>>>    try_charge                                                                   0.150%     8928
>>>    get_mem_cgroup_from_mm                                                       0.121%     7184
>
> Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
> the biggest consumers here are below 10% of the mem_cgroup_try_charge.

Note that they don't explain 10% of the mem_cgroup_try_charge. They 
*add* their own overhead to the overhead of mem_cgroup_try_charge 
itself. Which might be what you meant but I wasn't sure.

> Other than that the function doesn't do much else than some flags
> queries and css_put...
>
> Do you have the full trace?
> Sorry for a stupid question but do inlines
> from other header files get accounted to memcontrol.c?

Yes, perf doesn't know about them so it's accounted to function where 
the code physically is.

>
> [...]
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 14:53   ` Michal Hocko
  2015-05-19 15:12     ` Vlastimil Babka
@ 2015-05-19 15:13     ` Mel Gorman
  2015-05-19 15:25       ` Vlastimil Babka
  2015-05-19 15:27       ` Michal Hocko
  1 sibling, 2 replies; 14+ messages in thread
From: Mel Gorman @ 2015-05-19 15:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On Tue, May 19, 2015 at 04:53:40PM +0200, Michal Hocko wrote:
> On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
> > CC'ing Tejun and cgroups for the generic cgroup interface part
> > 
> > On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> [...]
> > > /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
> > >   mem_cgroup_try_charge                                                        2.950%   175781
> > 
> > Ouch.  Do you have a way to get the per-instruction breakdown of this?
> > This function really isn't doing much.  I'll try to reproduce it here
> > too, I haven't seen such high costs with pft in the past.
> > 
> > >   try_charge                                                                   0.150%     8928
> > >   get_mem_cgroup_from_mm                                                       0.121%     7184
> 
> Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
> the biggest consumers here are below 10% of the mem_cgroup_try_charge.
> Other than that the function doesn't do much else than some flags
> queries and css_put...
> 
> Do you have the full trace? Sorry for a stupid question but do inlines
> from other header files get accounted to memcontrol.c?
> 

The annotations for those functions look like with some very basic notes are
as follows. Note that I've done almost no research on this. I just noticed
that the memcg overhead was still there when looking for something else.

ffffffff811c15f0 <mem_cgroup_try_charge>: /* mem_cgroup_try_charge total: 176903  2.9692 */
   765  0.0128 :ffffffff811c15f0:       callq  ffffffff816435e0 <__fentry__>
    78  0.0013 :ffffffff811c15f5:       push   %rbp
  1185  0.0199 :ffffffff811c15f6:       mov    %rsp,%rbp
   356  0.0060 :ffffffff811c15f9:       push   %r14
   209  0.0035 :ffffffff811c15fb:       push   %r13
  1599  0.0268 :ffffffff811c15fd:       push   %r12
   320  0.0054 :ffffffff811c15ff:       mov    %rcx,%r12
   305  0.0051 :ffffffff811c1602:       push   %rbx
   325  0.0055 :ffffffff811c1603:       sub    $0x10,%rsp
   878  0.0147 :ffffffff811c1607:       mov    0xb7501b(%rip),%ecx        # ffffffff81d36628 <memory_cgrp_subsys+0x68>
   571  0.0096 :ffffffff811c160d:       test   %ecx,%ecx

### MEL: Function entry, check for mem_cgroup_disabled()


               :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
               :ffffffff811c1611:       xor    %eax,%eax
               :ffffffff811c1613:       xor    %ebx,%ebx
     1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
     7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
  1211  0.0203 :ffffffff811c161d:       pop    %rbx
     5 8.4e-05 :ffffffff811c161e:       pop    %r12
     5 8.4e-05 :ffffffff811c1620:       pop    %r13
  1249  0.0210 :ffffffff811c1622:       pop    %r14
     7 1.2e-04 :ffffffff811c1624:       pop    %rbp
     5 8.4e-05 :ffffffff811c1625:       retq   
               :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
   295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d

#### MEL: I was surprised to see this atrocity. It's a PageSwapCache check
#### /usr/src/linux-4.0-vanilla/./arch/x86/include/asm/bitops.h:311
#### /usr/src/linux-4.0-vanilla/include/linux/page-flags.h:261
#### /usr/src/linux-4.0-vanilla/mm/memcontrol.c:5473
####
#### Everything after here is consistent small amounts of overhead just from
#### being called a lot

   179  0.0030 :ffffffff811c1636:       test   $0x10000,%eax
               :ffffffff811c163b:       je     ffffffff811c1648 <mem_cgroup_try_charge+0x58>
               :ffffffff811c163d:       xor    %eax,%eax
               :ffffffff811c163f:       xor    %ebx,%ebx
               :ffffffff811c1641:       cmpq   $0x0,0x38(%rdi)
               :ffffffff811c1646:       jne    ffffffff811c1615 <mem_cgroup_try_charge+0x25>
  1343  0.0225 :ffffffff811c1648:       mov    (%rdi),%rax
    26 4.4e-04 :ffffffff811c164b:       mov    $0x1,%r14d
    24 4.0e-04 :ffffffff811c1651:       test   $0x40,%ah
               :ffffffff811c1654:       je     ffffffff811c1665 <mem_cgroup_try_charge+0x75>
               :ffffffff811c1656:       mov    (%rdi),%rax
               :ffffffff811c1659:       test   $0x40,%ah
               :ffffffff811c165c:       je     ffffffff811c1665 <mem_cgroup_try_charge+0x75>
               :ffffffff811c165e:       mov    0x68(%rdi),%rcx
               :ffffffff811c1662:       shl    %cl,%r14d
  1225  0.0206 :ffffffff811c1665:       mov    0xb74f35(%rip),%eax        # ffffffff81d365a0 <do_swap_account>
    66  0.0011 :ffffffff811c166b:       test   %eax,%eax
               :ffffffff811c166d:       jne    ffffffff811c16a8 <mem_cgroup_try_charge+0xb8>
     3 5.0e-05 :ffffffff811c166f:       mov    %rsi,%rdi
    22 3.7e-04 :ffffffff811c1672:       callq  ffffffff811bc920 <get_mem_cgroup_from_mm>
  1291  0.0217 :ffffffff811c1677:       mov    %rax,%rbx
     3 5.0e-05 :ffffffff811c167a:       mov    %r14d,%edx
               :ffffffff811c167d:       mov    %r13d,%esi
    10 1.7e-04 :ffffffff811c1680:       mov    %rbx,%rdi
  1380  0.0232 :ffffffff811c1683:       callq  ffffffff811c0950 <try_charge>
    10 1.7e-04 :ffffffff811c1688:       testb  $0x1,0x74(%rbx)
  1235  0.0207 :ffffffff811c168c:       je     ffffffff811c16d0 <mem_cgroup_try_charge+0xe0>
     7 1.2e-04 :ffffffff811c168e:       cmp    $0xfffffffc,%eax
               :ffffffff811c1691:       jne    ffffffff811c1615 <mem_cgroup_try_charge+0x25>
               :ffffffff811c1693:       mov    0xb74f0e(%rip),%rbx        # ffffffff81d365a8 <root_mem_cgroup>
               :ffffffff811c169a:       xor    %eax,%eax
               :ffffffff811c169c:       jmpq   ffffffff811c1615 <mem_cgroup_try_charge+0x25>
               :ffffffff811c16a1:       nopl   0x0(%rax)
               :ffffffff811c16a8:       mov    (%rdi),%rax
               :ffffffff811c16ab:       test   $0x10000,%eax
               :ffffffff811c16b0:       je     ffffffff811c166f <mem_cgroup_try_charge+0x7f>
               :ffffffff811c16b2:       mov    %rsi,-0x28(%rbp)
               :ffffffff811c16b6:       callq  ffffffff811c0450 <try_get_mem_cgroup_from_page>
               :ffffffff811c16bb:       test   %rax,%rax
               :ffffffff811c16be:       mov    %rax,%rbx
               :ffffffff811c16c1:       mov    -0x28(%rbp),%rsi
               :ffffffff811c16c5:       jne    ffffffff811c167a <mem_cgroup_try_charge+0x8a>
               :ffffffff811c16c7:       jmp    ffffffff811c166f <mem_cgroup_try_charge+0x7f>
               :ffffffff811c16c9:       nopl   0x0(%rax)
               :ffffffff811c16d0:       mov    0x18(%rbx),%rdx
               :ffffffff811c16d4:       test   $0x3,%dl
               :ffffffff811c16d7:       jne    ffffffff811c16df <mem_cgroup_try_charge+0xef>
               :ffffffff811c16d9:       decq   %gs:(%rdx)
               :ffffffff811c16dd:       jmp    ffffffff811c168e <mem_cgroup_try_charge+0x9e>
               :ffffffff811c16df:       lea    0x10(%rbx),%rdi
               :ffffffff811c16e3:       lock subq $0x1,0x10(%rbx)
               :ffffffff811c16e9:       je     ffffffff811c16ed <mem_cgroup_try_charge+0xfd>
               :ffffffff811c16eb:       jmp    ffffffff811c168e <mem_cgroup_try_charge+0x9e>
               :ffffffff811c16ed:       mov    %eax,-0x28(%rbp)
               :ffffffff811c16f0:       callq  *0x20(%rbx)
               :ffffffff811c16f3:       mov    -0x28(%rbp),%eax
               :ffffffff811c16f6:       jmp    ffffffff811c168e <mem_cgroup_try_charge+0x9e>
               :ffffffff811c16f8:       nopl   0x0(%rax,%rax,1)

ffffffff811bc920 <get_mem_cgroup_from_mm>: /* get_mem_cgroup_from_mm total:   7251  0.1217 */
#### MEL: Nothing really big jumped out there at me.
  1318  0.0221 :ffffffff811bc920:       callq  ffffffff816435e0 <__fentry__>
    19 3.2e-04 :ffffffff811bc925:       push   %rbp
    42 7.0e-04 :ffffffff811bc926:       mov    %rsp,%rbp
  1278  0.0215 :ffffffff811bc929:       jmp    ffffffff811bc94b <get_mem_cgroup_from_mm+0x2b>
               :ffffffff811bc92b:       nopl   0x0(%rax,%rax,1)
  1259  0.0211 :ffffffff811bc930:       testb  $0x1,0x74(%rdx)
   161  0.0027 :ffffffff811bc934:       jne    ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
               :ffffffff811bc936:       mov    0x18(%rdx),%rax
               :ffffffff811bc93a:       test   $0x3,%al
               :ffffffff811bc93c:       jne    ffffffff811bc985 <get_mem_cgroup_from_mm+0x65>
               :ffffffff811bc93e:       incq   %gs:(%rax)
               :ffffffff811bc942:       mov    $0x1,%eax
               :ffffffff811bc947:       test   %al,%al
               :ffffffff811bc949:       jne    ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
    13 2.2e-04 :ffffffff811bc94b:       test   %rdi,%rdi
               :ffffffff811bc94e:       je     ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
    47 7.9e-04 :ffffffff811bc950:       mov    0x340(%rdi),%rax
  1410  0.0237 :ffffffff811bc957:       test   %rax,%rax
               :ffffffff811bc95a:       je     ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
    26 4.4e-04 :ffffffff811bc95c:       mov    0xca0(%rax),%rax
   179  0.0030 :ffffffff811bc963:       mov    0x70(%rax),%rdx
   174  0.0029 :ffffffff811bc967:       test   %rdx,%rdx
               :ffffffff811bc96a:       jne    ffffffff811bc930 <get_mem_cgroup_from_mm+0x10>
               :ffffffff811bc96c:       mov    0xb79c35(%rip),%rdx        # ffffffff81d365a8 <root_mem_cgroup>
     1 1.7e-05 :ffffffff811bc973:       testb  $0x1,0x74(%rdx)
               :ffffffff811bc977:       je     ffffffff811bc936 <get_mem_cgroup_from_mm+0x16>
               :ffffffff811bc979:       nopl   0x0(%rax)
  1299  0.0218 :ffffffff811bc980:       mov    %rdx,%rax
     4 6.7e-05 :ffffffff811bc983:       pop    %rbp
    21 3.5e-04 :ffffffff811bc984:       retq   
               :ffffffff811bc985:       testb  $0x2,0x18(%rdx)
               :ffffffff811bc989:       jne    ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
               :ffffffff811bc98b:       mov    0x10(%rdx),%rcx
               :ffffffff811bc98f:       test   %rcx,%rcx
               :ffffffff811bc992:       je     ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
               :ffffffff811bc994:       lea    0x1(%rcx),%rsi
               :ffffffff811bc998:       lea    0x10(%rdx),%r8
               :ffffffff811bc99c:       mov    %rcx,%rax
               :ffffffff811bc99f:       lock cmpxchg %rsi,0x10(%rdx)
               :ffffffff811bc9a5:       cmp    %rcx,%rax
               :ffffffff811bc9a8:       mov    %rax,%rsi
               :ffffffff811bc9ab:       jne    ffffffff811bc9b4 <get_mem_cgroup_from_mm+0x94>
               :ffffffff811bc9ad:       mov    $0x1,%eax
               :ffffffff811bc9b2:       jmp    ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>
               :ffffffff811bc9b4:       test   %rsi,%rsi
               :ffffffff811bc9b7:       je     ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
               :ffffffff811bc9b9:       lea    0x1(%rsi),%rcx
               :ffffffff811bc9bd:       mov    %rsi,%rax
               :ffffffff811bc9c0:       lock cmpxchg %rcx,(%r8)
               :ffffffff811bc9c5:       cmp    %rax,%rsi
               :ffffffff811bc9c8:       je     ffffffff811bc9ad <get_mem_cgroup_from_mm+0x8d>
               :ffffffff811bc9ca:       mov    %rax,%rsi
               :ffffffff811bc9cd:       test   %rsi,%rsi
               :ffffffff811bc9d0:       jne    ffffffff811bc9b9 <get_mem_cgroup_from_mm+0x99>
               :ffffffff811bc9d2:       xor    %eax,%eax
               :ffffffff811bc9d4:       jmpq   ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 14:43   ` Mel Gorman
@ 2015-05-19 15:15     ` Michal Hocko
  2015-05-19 17:09       ` Ben Hutchings
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-05-19 15:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups, Ben Hutchings

[Let's CC Ben here - the email thread has started here:
http://marc.info/?l=linux-mm&m=143203206402073&w=2 and it seems Debian
is disabling memcg controller already so this might be of your interest]

On Tue 19-05-15 15:43:45, Mel Gorman wrote:
[...]
> After I wrote the patch, I spotted that Debian apparently already
> does something like this and by coincidence they matched the
> parameter name and values. See the memory controller instructions on
> https://wiki.debian.org/LXC#Prepare_the_host . So in this case at least
> upstream would match something that at least one distro in the field
> already uses.

I've read through
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964 and it seems
that the primary motivation for the runtime disabling was the _memory_
overhead of the struct page_cgroup
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964#152). This is
no longer the case since 1306a85aed3e ("mm: embed the memcg pointer
directly into struct page") merged in 3.19.

I can see some point in disabling the memcg due to runtime overhead.
There will always be some, albeit hard to notice. If an user really need
this to happen there is a command line option for that. The question is
who would do CONFIG_MEMCG && !MEMCG_DEFAULT_ENABLED.  Do you expect any
distributions go that way?
Ben, would you welcome such a change upstream or is there a reason to
change the Debian kernel runtime default now that the memory overhead is
mostly gone (for 3.19+ kernels of course)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 15:13     ` Mel Gorman
@ 2015-05-19 15:25       ` Vlastimil Babka
  2015-05-19 16:14         ` Johannes Weiner
  2015-05-19 15:27       ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2015-05-19 15:25 UTC (permalink / raw)
  To: Mel Gorman, Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On 05/19/2015 05:13 PM, Mel Gorman wrote:
> On Tue, May 19, 2015 at 04:53:40PM +0200, Michal Hocko wrote:
>> On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
>>> CC'ing Tejun and cgroups for the generic cgroup interface part
>>>
>>> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
>> [...]
>>>> /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
>>>>    mem_cgroup_try_charge                                                        2.950%   175781
>>>
>>> Ouch.  Do you have a way to get the per-instruction breakdown of this?
>>> This function really isn't doing much.  I'll try to reproduce it here
>>> too, I haven't seen such high costs with pft in the past.
>>>
>>>>    try_charge                                                                   0.150%     8928
>>>>    get_mem_cgroup_from_mm                                                       0.121%     7184
>>
>> Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
>> the biggest consumers here are below 10% of the mem_cgroup_try_charge.
>> Other than that the function doesn't do much else than some flags
>> queries and css_put...
>>
>> Do you have the full trace? Sorry for a stupid question but do inlines
>> from other header files get accounted to memcontrol.c?
>>
>
> The annotations for those functions look like with some very basic notes are
> as follows. Note that I've done almost no research on this. I just noticed
> that the memcg overhead was still there when looking for something else.
>
> ffffffff811c15f0 <mem_cgroup_try_charge>: /* mem_cgroup_try_charge total: 176903  2.9692 */
>     765  0.0128 :ffffffff811c15f0:       callq  ffffffff816435e0 <__fentry__>
>      78  0.0013 :ffffffff811c15f5:       push   %rbp
>    1185  0.0199 :ffffffff811c15f6:       mov    %rsp,%rbp
>     356  0.0060 :ffffffff811c15f9:       push   %r14
>     209  0.0035 :ffffffff811c15fb:       push   %r13
>    1599  0.0268 :ffffffff811c15fd:       push   %r12
>     320  0.0054 :ffffffff811c15ff:       mov    %rcx,%r12
>     305  0.0051 :ffffffff811c1602:       push   %rbx
>     325  0.0055 :ffffffff811c1603:       sub    $0x10,%rsp
>     878  0.0147 :ffffffff811c1607:       mov    0xb7501b(%rip),%ecx        # ffffffff81d36628 <memory_cgrp_subsys+0x68>
>     571  0.0096 :ffffffff811c160d:       test   %ecx,%ecx
>
> ### MEL: Function entry, check for mem_cgroup_disabled()
>
>
>                 :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
>                 :ffffffff811c1611:       xor    %eax,%eax
>                 :ffffffff811c1613:       xor    %ebx,%ebx
>       1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
>       7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
>    1211  0.0203 :ffffffff811c161d:       pop    %rbx
>       5 8.4e-05 :ffffffff811c161e:       pop    %r12
>       5 8.4e-05 :ffffffff811c1620:       pop    %r13
>    1249  0.0210 :ffffffff811c1622:       pop    %r14
>       7 1.2e-04 :ffffffff811c1624:       pop    %rbp
>       5 8.4e-05 :ffffffff811c1625:       retq
>                 :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
>     295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
> 160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d
>
> #### MEL: I was surprised to see this atrocity. It's a PageSwapCache check

Looks like sampling is off by instruction, because why would a reg->reg 
mov took so long. So it's probably a cache miss on struct page, pointer 
to which is in rdi. Which is weird, I would expect memcg to be called on 
struct pages that are already hot. It would also mean that if you don't 
fetch the struct page from the memcg code, then the following code in 
the caller will most likely work on the struct page and get the cache 
miss anyway?

> #### /usr/src/linux-4.0-vanilla/./arch/x86/include/asm/bitops.h:311
> #### /usr/src/linux-4.0-vanilla/include/linux/page-flags.h:261
> #### /usr/src/linux-4.0-vanilla/mm/memcontrol.c:5473
> ####
> #### Everything after here is consistent small amounts of overhead just from
> #### being called a lot
>
>     179  0.0030 :ffffffff811c1636:       test   $0x10000,%eax
>                 :ffffffff811c163b:       je     ffffffff811c1648 <mem_cgroup_try_charge+0x58>
>                 :ffffffff811c163d:       xor    %eax,%eax
>                 :ffffffff811c163f:       xor    %ebx,%ebx
>                 :ffffffff811c1641:       cmpq   $0x0,0x38(%rdi)
>                 :ffffffff811c1646:       jne    ffffffff811c1615 <mem_cgroup_try_charge+0x25>
>    1343  0.0225 :ffffffff811c1648:       mov    (%rdi),%rax
>      26 4.4e-04 :ffffffff811c164b:       mov    $0x1,%r14d
>      24 4.0e-04 :ffffffff811c1651:       test   $0x40,%ah
>                 :ffffffff811c1654:       je     ffffffff811c1665 <mem_cgroup_try_charge+0x75>
>                 :ffffffff811c1656:       mov    (%rdi),%rax
>                 :ffffffff811c1659:       test   $0x40,%ah
>                 :ffffffff811c165c:       je     ffffffff811c1665 <mem_cgroup_try_charge+0x75>
>                 :ffffffff811c165e:       mov    0x68(%rdi),%rcx
>                 :ffffffff811c1662:       shl    %cl,%r14d
>    1225  0.0206 :ffffffff811c1665:       mov    0xb74f35(%rip),%eax        # ffffffff81d365a0 <do_swap_account>
>      66  0.0011 :ffffffff811c166b:       test   %eax,%eax
>                 :ffffffff811c166d:       jne    ffffffff811c16a8 <mem_cgroup_try_charge+0xb8>
>       3 5.0e-05 :ffffffff811c166f:       mov    %rsi,%rdi
>      22 3.7e-04 :ffffffff811c1672:       callq  ffffffff811bc920 <get_mem_cgroup_from_mm>
>    1291  0.0217 :ffffffff811c1677:       mov    %rax,%rbx
>       3 5.0e-05 :ffffffff811c167a:       mov    %r14d,%edx
>                 :ffffffff811c167d:       mov    %r13d,%esi
>      10 1.7e-04 :ffffffff811c1680:       mov    %rbx,%rdi
>    1380  0.0232 :ffffffff811c1683:       callq  ffffffff811c0950 <try_charge>
>      10 1.7e-04 :ffffffff811c1688:       testb  $0x1,0x74(%rbx)
>    1235  0.0207 :ffffffff811c168c:       je     ffffffff811c16d0 <mem_cgroup_try_charge+0xe0>
>       7 1.2e-04 :ffffffff811c168e:       cmp    $0xfffffffc,%eax
>                 :ffffffff811c1691:       jne    ffffffff811c1615 <mem_cgroup_try_charge+0x25>
>                 :ffffffff811c1693:       mov    0xb74f0e(%rip),%rbx        # ffffffff81d365a8 <root_mem_cgroup>
>                 :ffffffff811c169a:       xor    %eax,%eax
>                 :ffffffff811c169c:       jmpq   ffffffff811c1615 <mem_cgroup_try_charge+0x25>
>                 :ffffffff811c16a1:       nopl   0x0(%rax)
>                 :ffffffff811c16a8:       mov    (%rdi),%rax
>                 :ffffffff811c16ab:       test   $0x10000,%eax
>                 :ffffffff811c16b0:       je     ffffffff811c166f <mem_cgroup_try_charge+0x7f>
>                 :ffffffff811c16b2:       mov    %rsi,-0x28(%rbp)
>                 :ffffffff811c16b6:       callq  ffffffff811c0450 <try_get_mem_cgroup_from_page>
>                 :ffffffff811c16bb:       test   %rax,%rax
>                 :ffffffff811c16be:       mov    %rax,%rbx
>                 :ffffffff811c16c1:       mov    -0x28(%rbp),%rsi
>                 :ffffffff811c16c5:       jne    ffffffff811c167a <mem_cgroup_try_charge+0x8a>
>                 :ffffffff811c16c7:       jmp    ffffffff811c166f <mem_cgroup_try_charge+0x7f>
>                 :ffffffff811c16c9:       nopl   0x0(%rax)
>                 :ffffffff811c16d0:       mov    0x18(%rbx),%rdx
>                 :ffffffff811c16d4:       test   $0x3,%dl
>                 :ffffffff811c16d7:       jne    ffffffff811c16df <mem_cgroup_try_charge+0xef>
>                 :ffffffff811c16d9:       decq   %gs:(%rdx)
>                 :ffffffff811c16dd:       jmp    ffffffff811c168e <mem_cgroup_try_charge+0x9e>
>                 :ffffffff811c16df:       lea    0x10(%rbx),%rdi
>                 :ffffffff811c16e3:       lock subq $0x1,0x10(%rbx)
>                 :ffffffff811c16e9:       je     ffffffff811c16ed <mem_cgroup_try_charge+0xfd>
>                 :ffffffff811c16eb:       jmp    ffffffff811c168e <mem_cgroup_try_charge+0x9e>
>                 :ffffffff811c16ed:       mov    %eax,-0x28(%rbp)
>                 :ffffffff811c16f0:       callq  *0x20(%rbx)
>                 :ffffffff811c16f3:       mov    -0x28(%rbp),%eax
>                 :ffffffff811c16f6:       jmp    ffffffff811c168e <mem_cgroup_try_charge+0x9e>
>                 :ffffffff811c16f8:       nopl   0x0(%rax,%rax,1)
>
> ffffffff811bc920 <get_mem_cgroup_from_mm>: /* get_mem_cgroup_from_mm total:   7251  0.1217 */
> #### MEL: Nothing really big jumped out there at me.
>    1318  0.0221 :ffffffff811bc920:       callq  ffffffff816435e0 <__fentry__>
>      19 3.2e-04 :ffffffff811bc925:       push   %rbp
>      42 7.0e-04 :ffffffff811bc926:       mov    %rsp,%rbp
>    1278  0.0215 :ffffffff811bc929:       jmp    ffffffff811bc94b <get_mem_cgroup_from_mm+0x2b>
>                 :ffffffff811bc92b:       nopl   0x0(%rax,%rax,1)
>    1259  0.0211 :ffffffff811bc930:       testb  $0x1,0x74(%rdx)
>     161  0.0027 :ffffffff811bc934:       jne    ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
>                 :ffffffff811bc936:       mov    0x18(%rdx),%rax
>                 :ffffffff811bc93a:       test   $0x3,%al
>                 :ffffffff811bc93c:       jne    ffffffff811bc985 <get_mem_cgroup_from_mm+0x65>
>                 :ffffffff811bc93e:       incq   %gs:(%rax)
>                 :ffffffff811bc942:       mov    $0x1,%eax
>                 :ffffffff811bc947:       test   %al,%al
>                 :ffffffff811bc949:       jne    ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
>      13 2.2e-04 :ffffffff811bc94b:       test   %rdi,%rdi
>                 :ffffffff811bc94e:       je     ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
>      47 7.9e-04 :ffffffff811bc950:       mov    0x340(%rdi),%rax
>    1410  0.0237 :ffffffff811bc957:       test   %rax,%rax
>                 :ffffffff811bc95a:       je     ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
>      26 4.4e-04 :ffffffff811bc95c:       mov    0xca0(%rax),%rax
>     179  0.0030 :ffffffff811bc963:       mov    0x70(%rax),%rdx
>     174  0.0029 :ffffffff811bc967:       test   %rdx,%rdx
>                 :ffffffff811bc96a:       jne    ffffffff811bc930 <get_mem_cgroup_from_mm+0x10>
>                 :ffffffff811bc96c:       mov    0xb79c35(%rip),%rdx        # ffffffff81d365a8 <root_mem_cgroup>
>       1 1.7e-05 :ffffffff811bc973:       testb  $0x1,0x74(%rdx)
>                 :ffffffff811bc977:       je     ffffffff811bc936 <get_mem_cgroup_from_mm+0x16>
>                 :ffffffff811bc979:       nopl   0x0(%rax)
>    1299  0.0218 :ffffffff811bc980:       mov    %rdx,%rax
>       4 6.7e-05 :ffffffff811bc983:       pop    %rbp
>      21 3.5e-04 :ffffffff811bc984:       retq
>                 :ffffffff811bc985:       testb  $0x2,0x18(%rdx)
>                 :ffffffff811bc989:       jne    ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
>                 :ffffffff811bc98b:       mov    0x10(%rdx),%rcx
>                 :ffffffff811bc98f:       test   %rcx,%rcx
>                 :ffffffff811bc992:       je     ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
>                 :ffffffff811bc994:       lea    0x1(%rcx),%rsi
>                 :ffffffff811bc998:       lea    0x10(%rdx),%r8
>                 :ffffffff811bc99c:       mov    %rcx,%rax
>                 :ffffffff811bc99f:       lock cmpxchg %rsi,0x10(%rdx)
>                 :ffffffff811bc9a5:       cmp    %rcx,%rax
>                 :ffffffff811bc9a8:       mov    %rax,%rsi
>                 :ffffffff811bc9ab:       jne    ffffffff811bc9b4 <get_mem_cgroup_from_mm+0x94>
>                 :ffffffff811bc9ad:       mov    $0x1,%eax
>                 :ffffffff811bc9b2:       jmp    ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>
>                 :ffffffff811bc9b4:       test   %rsi,%rsi
>                 :ffffffff811bc9b7:       je     ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
>                 :ffffffff811bc9b9:       lea    0x1(%rsi),%rcx
>                 :ffffffff811bc9bd:       mov    %rsi,%rax
>                 :ffffffff811bc9c0:       lock cmpxchg %rcx,(%r8)
>                 :ffffffff811bc9c5:       cmp    %rax,%rsi
>                 :ffffffff811bc9c8:       je     ffffffff811bc9ad <get_mem_cgroup_from_mm+0x8d>
>                 :ffffffff811bc9ca:       mov    %rax,%rsi
>                 :ffffffff811bc9cd:       test   %rsi,%rsi
>                 :ffffffff811bc9d0:       jne    ffffffff811bc9b9 <get_mem_cgroup_from_mm+0x99>
>                 :ffffffff811bc9d2:       xor    %eax,%eax
>                 :ffffffff811bc9d4:       jmpq   ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 15:13     ` Mel Gorman
  2015-05-19 15:25       ` Vlastimil Babka
@ 2015-05-19 15:27       ` Michal Hocko
  2015-05-19 15:41         ` Mel Gorman
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2015-05-19 15:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On Tue 19-05-15 16:13:02, Mel Gorman wrote:
[...]
>                :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
>                :ffffffff811c1611:       xor    %eax,%eax
>                :ffffffff811c1613:       xor    %ebx,%ebx
>      1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
>      7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
>   1211  0.0203 :ffffffff811c161d:       pop    %rbx
>      5 8.4e-05 :ffffffff811c161e:       pop    %r12
>      5 8.4e-05 :ffffffff811c1620:       pop    %r13
>   1249  0.0210 :ffffffff811c1622:       pop    %r14
>      7 1.2e-04 :ffffffff811c1624:       pop    %rbp
>      5 8.4e-05 :ffffffff811c1625:       retq   
>                :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
>    295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
> 160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d

Huh, what? Even if this was off by one and the preceding instruction has
consumed the time. This would be reading from page->flags but the page
should be hot by the time we got here, no?

> #### MEL: I was surprised to see this atrocity. It's a PageSwapCache check
> #### /usr/src/linux-4.0-vanilla/./arch/x86/include/asm/bitops.h:311
> #### /usr/src/linux-4.0-vanilla/include/linux/page-flags.h:261
> #### /usr/src/linux-4.0-vanilla/mm/memcontrol.c:5473
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 15:27       ` Michal Hocko
@ 2015-05-19 15:41         ` Mel Gorman
  2015-05-19 16:04           ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2015-05-19 15:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On Tue, May 19, 2015 at 05:27:10PM +0200, Michal Hocko wrote:
> On Tue 19-05-15 16:13:02, Mel Gorman wrote:
> [...]
> >                :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> >                :ffffffff811c1611:       xor    %eax,%eax
> >                :ffffffff811c1613:       xor    %ebx,%ebx
> >      1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
> >      7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
> >   1211  0.0203 :ffffffff811c161d:       pop    %rbx
> >      5 8.4e-05 :ffffffff811c161e:       pop    %r12
> >      5 8.4e-05 :ffffffff811c1620:       pop    %r13
> >   1249  0.0210 :ffffffff811c1622:       pop    %r14
> >      7 1.2e-04 :ffffffff811c1624:       pop    %rbp
> >      5 8.4e-05 :ffffffff811c1625:       retq   
> >                :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
> >    295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
> > 160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d
> 
> Huh, what? Even if this was off by one and the preceding instruction has
> consumed the time. This would be reading from page->flags but the page
> should be hot by the time we got here, no?
> 

I would have expected so but it's not the first time I've seen cases where
examining the flags was a costly instruction. I suspect it's due to an
ordering issue or more likely, a frequent branch mispredict that is being
accounted for against this instruction.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 15:41         ` Mel Gorman
@ 2015-05-19 16:04           ` Mel Gorman
  2015-05-19 19:32             ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2015-05-19 16:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On Tue, May 19, 2015 at 04:41:19PM +0100, Mel Gorman wrote:
> On Tue, May 19, 2015 at 05:27:10PM +0200, Michal Hocko wrote:
> > On Tue 19-05-15 16:13:02, Mel Gorman wrote:
> > [...]
> > >                :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> > >                :ffffffff811c1611:       xor    %eax,%eax
> > >                :ffffffff811c1613:       xor    %ebx,%ebx
> > >      1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
> > >      7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
> > >   1211  0.0203 :ffffffff811c161d:       pop    %rbx
> > >      5 8.4e-05 :ffffffff811c161e:       pop    %r12
> > >      5 8.4e-05 :ffffffff811c1620:       pop    %r13
> > >   1249  0.0210 :ffffffff811c1622:       pop    %r14
> > >      7 1.2e-04 :ffffffff811c1624:       pop    %rbp
> > >      5 8.4e-05 :ffffffff811c1625:       retq   
> > >                :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
> > >    295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
> > > 160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d
> > 
> > Huh, what? Even if this was off by one and the preceding instruction has
> > consumed the time. This would be reading from page->flags but the page
> > should be hot by the time we got here, no?
> > 
> 
> I would have expected so but it's not the first time I've seen cases where
> examining the flags was a costly instruction. I suspect it's due to an
> ordering issue or more likely, a frequent branch mispredict that is being
> accounted for against this instruction.
> 

Which is plausible as forward branches are statically predicted false but
in this particular load that could be a close to a 100% mispredict.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 15:25       ` Vlastimil Babka
@ 2015-05-19 16:14         ` Johannes Weiner
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2015-05-19 16:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Michal Hocko, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On Tue, May 19, 2015 at 05:25:36PM +0200, Vlastimil Babka wrote:
> On 05/19/2015 05:13 PM, Mel Gorman wrote:
> >### MEL: Function entry, check for mem_cgroup_disabled()
> >
> >
> >                :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> >                :ffffffff811c1611:       xor    %eax,%eax
> >                :ffffffff811c1613:       xor    %ebx,%ebx
> >      1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
> >      7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
> >   1211  0.0203 :ffffffff811c161d:       pop    %rbx
> >      5 8.4e-05 :ffffffff811c161e:       pop    %r12
> >      5 8.4e-05 :ffffffff811c1620:       pop    %r13
> >   1249  0.0210 :ffffffff811c1622:       pop    %r14
> >      7 1.2e-04 :ffffffff811c1624:       pop    %rbp
> >      5 8.4e-05 :ffffffff811c1625:       retq
> >                :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
> >    295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
> >160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d
> >
> >#### MEL: I was surprised to see this atrocity. It's a PageSwapCache check
> 
> Looks like sampling is off by instruction, because why would a reg->reg mov
> took so long. So it's probably a cache miss on struct page, pointer to which
> is in rdi. Which is weird, I would expect memcg to be called on struct pages
> that are already hot.

Yeah, anonymous faults do __SetPageUptodate() right before passing the
page into mem_cgroup_try_charge().  page->flags should be hot.

> It would also mean that if you don't fetch the struct
> page from the memcg code, then the following code in the caller will most
> likely work on the struct page and get the cache miss anyway?

Which is why the runtime reduction doesn't match the profile
reduction.  The cost seems to get shifted somewhere else.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 15:15     ` Michal Hocko
@ 2015-05-19 17:09       ` Ben Hutchings
  0 siblings, 0 replies; 14+ messages in thread
From: Ben Hutchings @ 2015-05-19 17:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Johannes Weiner, linux-mm, linux-kernel,
	Andrew Morton, Tejun Heo, cgroups, Debian kernel maintainers

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

On Tue, 2015-05-19 at 17:15 +0200, Michal Hocko wrote:
> [Let's CC Ben here - the email thread has started here:
> http://marc.info/?l=linux-mm&m=143203206402073&w=2 and it seems Debian
> is disabling memcg controller already so this might be of your interest]
> 
> On Tue 19-05-15 15:43:45, Mel Gorman wrote:
> [...]
> > After I wrote the patch, I spotted that Debian apparently already
> > does something like this and by coincidence they matched the
> > parameter name and values. See the memory controller instructions on
> > https://wiki.debian.org/LXC#Prepare_the_host . So in this case at least
> > upstream would match something that at least one distro in the field
> > already uses.
> 
> I've read through
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964 and it seems
> that the primary motivation for the runtime disabling was the _memory_
> overhead of the struct page_cgroup
> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964#152). This is
> no longer the case since 1306a85aed3e ("mm: embed the memcg pointer
> directly into struct page") merged in 3.19.
> 
> I can see some point in disabling the memcg due to runtime overhead.

I was also concerned about runtime overhead.

> There will always be some, albeit hard to notice. If an user really need
> this to happen there is a command line option for that. The question is
> who would do CONFIG_MEMCG && !MEMCG_DEFAULT_ENABLED.  Do you expect any
> distributions go that way?
> Ben, would you welcome such a change upstream or is there a reason to
> change the Debian kernel runtime default now that the memory overhead is
> mostly gone (for 3.19+ kernels of course)?

I have been meaning to reevaluate this as I know the overhead has been
reduced.  Given Mel's benchmark results, I favour keeping it disabled by
default in Debian.  So I would welcome this change.

Ben.

-- 
Ben Hutchings
I'm not a reverse psychological virus.  Please don't copy me into your sig.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig
  2015-05-19 16:04           ` Mel Gorman
@ 2015-05-19 19:32             ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2015-05-19 19:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	Tejun Heo, cgroups

On Tue, May 19, 2015 at 05:04:04PM +0100, Mel Gorman wrote:
> On Tue, May 19, 2015 at 04:41:19PM +0100, Mel Gorman wrote:
> > On Tue, May 19, 2015 at 05:27:10PM +0200, Michal Hocko wrote:
> > > On Tue 19-05-15 16:13:02, Mel Gorman wrote:
> > > [...]
> > > >                :ffffffff811c160f:       je     ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> > > >                :ffffffff811c1611:       xor    %eax,%eax
> > > >                :ffffffff811c1613:       xor    %ebx,%ebx
> > > >      1 1.7e-05 :ffffffff811c1615:       mov    %rbx,(%r12)
> > > >      7 1.2e-04 :ffffffff811c1619:       add    $0x10,%rsp
> > > >   1211  0.0203 :ffffffff811c161d:       pop    %rbx
> > > >      5 8.4e-05 :ffffffff811c161e:       pop    %r12
> > > >      5 8.4e-05 :ffffffff811c1620:       pop    %r13
> > > >   1249  0.0210 :ffffffff811c1622:       pop    %r14
> > > >      7 1.2e-04 :ffffffff811c1624:       pop    %rbp
> > > >      5 8.4e-05 :ffffffff811c1625:       retq   
> > > >                :ffffffff811c1626:       nopw   %cs:0x0(%rax,%rax,1)
> > > >    295  0.0050 :ffffffff811c1630:       mov    (%rdi),%rax
> > > > 160703  2.6973 :ffffffff811c1633:       mov    %edx,%r13d
> > > 
> > > Huh, what? Even if this was off by one and the preceding instruction has
> > > consumed the time. This would be reading from page->flags but the page
> > > should be hot by the time we got here, no?
> > > 
> > 
> > I would have expected so but it's not the first time I've seen cases where
> > examining the flags was a costly instruction. I suspect it's due to an
> > ordering issue or more likely, a frequent branch mispredict that is being
> > accounted for against this instruction.
> > 
> 
> Which is plausible as forward branches are statically predicted false but
> in this particular load that could be a close to a 100% mispredict.
> 

Plausible but wrong. The responsible instruction was too far away so it
looks more like an ordering issue where the PageSwapCache check must be
ordered against the setting of page up to date. __SetPageUptodate is a
barrier that is necessary before the PTE is established and visible but it
does not have to be ordered against the memcg charging. In fact it makes
sense to do it afterwards in case the charge fails and the page is never
visible. Just adjusting that reduces the cost to

/usr/src/linux-4.0-chargefirst-v1r1/mm/memcontrol.c                  3.8547   228233
  __mem_cgroup_count_vm_event                                                  1.172%    69393
  mem_cgroup_page_lruvec                                                       0.464%    27456
  mem_cgroup_commit_charge                                                     0.390%    23072
  uncharge_list                                                                0.327%    19370
  mem_cgroup_update_lru_size                                                   0.284%    16831
  get_mem_cgroup_from_mm                                                       0.262%    15523
  mem_cgroup_try_charge                                                        0.256%    15147
  memcg_check_events                                                           0.222%    13120
  mem_cgroup_charge_statistics.isra.22                                         0.194%    11470
  commit_charge                                                                0.145%     8615
  try_charge                                                                   0.139%     8236

Big sinner there is updating per-cpu stats -- root cgroup stats I assume? To
refresh, a complete disable looks like

/usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c                      0.4834    27511
  mem_cgroup_page_lruvec                                                       0.161%     9172
  mem_cgroup_update_lru_size                                                   0.154%     8794
  mem_cgroup_try_charge                                                        0.126%     7194
  mem_cgroup_commit_charge                                                     0.041%     2351

Still, 6.64% down to 3.85% is better than a kick in the head. Unprofiled
performance looks like

pft faults
                                       4.0.0                  4.0.0                 4.0.0
                                     vanilla             nomemcg-v1        chargefirst-v1
Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1530574.6033 (  6.05%) 1487623.0037 (  3.07%)
Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1375156.5834 (  2.59%) 1351401.2578 (  0.82%)
Hmean    faults/cpu-5  875599.0222 (  0.00%)  876217.9211 (  0.07%)  876122.6489 (  0.06%)
Hmean    faults/cpu-7  601146.6726 (  0.00%)  599068.4360 ( -0.35%)  600944.9229 ( -0.03%)
Hmean    faults/cpu-8  510728.2754 (  0.00%)  509887.9960 ( -0.16%)  510906.3818 (  0.03%)
Hmean    faults/sec-1 1432084.7845 (  0.00%) 1518566.3541 (  6.04%) 1475994.2194 (  3.07%)
Hmean    faults/sec-3 3943818.1437 (  0.00%) 4036918.0217 (  2.36%) 3973070.2159 (  0.74%)
Hmean    faults/sec-5 3877573.5867 (  0.00%) 3922745.9207 (  1.16%) 3891705.1749 (  0.36%)
Hmean    faults/sec-7 3991832.0418 (  0.00%) 3990670.8481 ( -0.03%) 3989110.4674 ( -0.07%)
Hmean    faults/sec-8 3987189.8167 (  0.00%) 3978842.8107 ( -0.21%) 3981011.2936 ( -0.15%)

Very minor boost. The same reordering looks like it would also suit
do_wp_page. I'll do that, retest, put some lipstick on the patches and
post them tomorrow the day after. The reordering one probably makes sense
anyway, the default disabling of memcg still has merit but maybe if that
charging of the root group can be eliminated then it'd be pointless.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-05-19 19:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-19 10:40 [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig Mel Gorman
2015-05-19 14:18 ` Johannes Weiner
2015-05-19 14:43   ` Mel Gorman
2015-05-19 15:15     ` Michal Hocko
2015-05-19 17:09       ` Ben Hutchings
2015-05-19 14:53   ` Michal Hocko
2015-05-19 15:12     ` Vlastimil Babka
2015-05-19 15:13     ` Mel Gorman
2015-05-19 15:25       ` Vlastimil Babka
2015-05-19 16:14         ` Johannes Weiner
2015-05-19 15:27       ` Michal Hocko
2015-05-19 15:41         ` Mel Gorman
2015-05-19 16:04           ` Mel Gorman
2015-05-19 19:32             ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).