All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7]  per-cgroup tcp buffer pressure settings
@ 2011-09-19  0:56 ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

[[ v3: merge Kirill's suggestions, + a destroy-related bugfix ]]

This patch introduces per-cgroup tcp buffers limitation. This allows
sysadmins to specify a maximum amount of kernel memory that
tcp connections can use at any point in time. TCP is the main interest
in this work, but extending it to other protocols would be easy.

For this to work, I am hooking it into memcg, after the introdution of
an extension for tracking and controlling objects in kernel memory.
Since they are usually not found in page granularity, and are fundamentally
different from userspace memory (not swappable, can't overcommit), they
need their special place inside the Memory Controller.

Right now, the kmem extension is quite basic, and just lays down the
basic infrastucture for the ongoing work. 

Although it does not account kernel memory allocated - I preferred to
keep this series simple and leave accounting to the slab allocations when
they arrive.

What it does is to piggyback in the memory control mechanism already present in
/proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
that will suppress allocation when reached. For each non-root cgroup, however,
the file kmem.tcp_maxmem will be used to cap those values. 

The usage I have in mind here is containers. Each container will
define its own values for soft and hard limits, but none of them will
be possibly bigger than the value the box' sysadmin specified from
the outside.

To test for any performance impacts of this patch, I used netperf's
TCP_RR benchmark on localhost, so we can have both recv and snd in action.
For this iteration, I am using the 1% confidence interval as suggested by Rick.

Command line used was ./src/netperf -t TCP_RR -H localhost -i 30,3 -I 99,1 and the
results:

Without the patch
=================

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    35367.21
16384  87380


With the patch
==============

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00   	35477.10 
16384  87380

The difference is less than 0.5 %

A simple test with a 1000 level nesting yields more or less the same
difference:

1000 level nesting
==================

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    35305.21   
16384  87380 


Glauber Costa (7):
  Basic kernel memory functionality for the Memory Controller
  socket: initial cgroup code.
  foundations of per-cgroup memory pressure controlling.
  per-cgroup tcp buffers control
  per-netns ipv4 sysctl_tcp_mem
  tcp buffer limitation: per-cgroup limit
  Display current tcp memory allocation in kmem cgroup

 Documentation/cgroups/memory.txt |   32 +++-
 crypto/af_alg.c                  |    7 +-
 include/linux/memcontrol.h       |   58 ++++++
 include/net/netns/ipv4.h         |    1 +
 include/net/sock.h               |  127 ++++++++++++-
 include/net/tcp.h                |   14 +-
 include/net/udp.h                |    3 +-
 include/trace/events/sock.h      |   10 +-
 init/Kconfig                     |   11 +
 mm/memcontrol.c                  |  406 ++++++++++++++++++++++++++++++++++++--
 net/core/sock.c                  |  103 +++++++---
 net/decnet/af_decnet.c           |   21 ++-
 net/ipv4/proc.c                  |    7 +-
 net/ipv4/sysctl_net_ipv4.c       |   71 ++++++-
 net/ipv4/tcp.c                   |   58 ++++---
 net/ipv4/tcp_input.c             |   12 +-
 net/ipv4/tcp_ipv4.c              |   18 +-
 net/ipv4/tcp_output.c            |    2 +-
 net/ipv4/tcp_timer.c             |    2 +-
 net/ipv4/udp.c                   |   20 ++-
 net/ipv6/tcp_ipv6.c              |   15 +-
 net/ipv6/udp.c                   |    4 +-
 net/sctp/socket.c                |   35 +++-
 23 files changed, 904 insertions(+), 133 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH v3 0/7]  per-cgroup tcp buffer pressure settings
@ 2011-09-19  0:56 ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

[[ v3: merge Kirill's suggestions, + a destroy-related bugfix ]]

This patch introduces per-cgroup tcp buffers limitation. This allows
sysadmins to specify a maximum amount of kernel memory that
tcp connections can use at any point in time. TCP is the main interest
in this work, but extending it to other protocols would be easy.

For this to work, I am hooking it into memcg, after the introdution of
an extension for tracking and controlling objects in kernel memory.
Since they are usually not found in page granularity, and are fundamentally
different from userspace memory (not swappable, can't overcommit), they
need their special place inside the Memory Controller.

Right now, the kmem extension is quite basic, and just lays down the
basic infrastucture for the ongoing work. 

Although it does not account kernel memory allocated - I preferred to
keep this series simple and leave accounting to the slab allocations when
they arrive.

What it does is to piggyback in the memory control mechanism already present in
/proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
that will suppress allocation when reached. For each non-root cgroup, however,
the file kmem.tcp_maxmem will be used to cap those values. 

The usage I have in mind here is containers. Each container will
define its own values for soft and hard limits, but none of them will
be possibly bigger than the value the box' sysadmin specified from
the outside.

To test for any performance impacts of this patch, I used netperf's
TCP_RR benchmark on localhost, so we can have both recv and snd in action.
For this iteration, I am using the 1% confidence interval as suggested by Rick.

Command line used was ./src/netperf -t TCP_RR -H localhost -i 30,3 -I 99,1 and the
results:

Without the patch
=================

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    35367.21
16384  87380


With the patch
==============

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00   	35477.10 
16384  87380

The difference is less than 0.5 %

A simple test with a 1000 level nesting yields more or less the same
difference:

1000 level nesting
==================

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    35305.21   
16384  87380 


Glauber Costa (7):
  Basic kernel memory functionality for the Memory Controller
  socket: initial cgroup code.
  foundations of per-cgroup memory pressure controlling.
  per-cgroup tcp buffers control
  per-netns ipv4 sysctl_tcp_mem
  tcp buffer limitation: per-cgroup limit
  Display current tcp memory allocation in kmem cgroup

 Documentation/cgroups/memory.txt |   32 +++-
 crypto/af_alg.c                  |    7 +-
 include/linux/memcontrol.h       |   58 ++++++
 include/net/netns/ipv4.h         |    1 +
 include/net/sock.h               |  127 ++++++++++++-
 include/net/tcp.h                |   14 +-
 include/net/udp.h                |    3 +-
 include/trace/events/sock.h      |   10 +-
 init/Kconfig                     |   11 +
 mm/memcontrol.c                  |  406 ++++++++++++++++++++++++++++++++++++--
 net/core/sock.c                  |  103 +++++++---
 net/decnet/af_decnet.c           |   21 ++-
 net/ipv4/proc.c                  |    7 +-
 net/ipv4/sysctl_net_ipv4.c       |   71 ++++++-
 net/ipv4/tcp.c                   |   58 ++++---
 net/ipv4/tcp_input.c             |   12 +-
 net/ipv4/tcp_ipv4.c              |   18 +-
 net/ipv4/tcp_output.c            |    2 +-
 net/ipv4/tcp_timer.c             |    2 +-
 net/ipv4/udp.c                   |   20 ++-
 net/ipv6/tcp_ipv6.c              |   15 +-
 net/ipv6/udp.c                   |    4 +-
 net/sctp/socket.c                |   35 +++-
 23 files changed, 904 insertions(+), 133 deletions(-)

-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch lays down the foundation for the kernel memory component
of the Memory Controller.

As of today, I am only laying down the following files:

 * memory.independent_kmem_limit
 * memory.kmem.limit_in_bytes (currently ignored)
 * memory.kmem.usage_in_bytes (always zero)

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
---
 Documentation/cgroups/memory.txt |   30 +++++++++-
 init/Kconfig                     |   11 ++++
 mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 148 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f3c598..6f1954a 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -44,8 +44,9 @@ Features:
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
 
- Kernel memory and Hugepages are not under control yet. We just manage
- pages on LRU. To add more controls, we have to take care of performance.
+ Hugepages is not under control yet. We just manage pages on LRU. To add more
+ controls, we have to take care of performance. Kernel memory support is work
+ in progress, and the current version provides basically functionality.
 
 Brief summary of control files.
 
@@ -56,8 +57,11 @@ Brief summary of control files.
 				 (See 5.5 for details)
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
+ memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
+				 (See 2.7 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
+ memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
@@ -72,6 +76,9 @@ Brief summary of control files.
  memory.oom_control		 # set/show oom controls.
  memory.numa_stat		 # show the number of memory usage per numa node
 
+ memory.independent_kmem_limit	 # select whether or not kernel memory limits are
+				   independent of user limits
+
 1. History
 
 The memory controller has a long history. A request for comments for the memory
@@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
   zone->lru_lock, it has no lock of its own.
 
+2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+
+ With the Kernel memory extension, the Memory Controller is able to limit
+the amount of kernel memory used by the system. Kernel memory is fundamentally
+different than user memory, since it can't be swapped out, which makes it
+possible to DoS the system by consuming too much of this precious resource.
+Kernel memory limits are not imposed for the root cgroup.
+
+Memory limits as specified by the standard Memory Controller may or may not
+take kernel memory into consideration. This is achieved through the file
+memory.independent_kmem_limit. A Value different than 0 will allow for kernel
+memory to be controlled separately.
+
+When kernel memory limits are not independent, the limit values set in
+memory.kmem files are ignored.
+
+Currently no soft limit is implemented for kernel memory. It is future work
+to trigger slab reclaim when those limits are reached.
+
 3. User Interface
 
 0. Configuration
diff --git a/init/Kconfig b/init/Kconfig
index d627783..49e5839 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  For those who want to have the feature enabled by default should
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
+config CGROUP_MEM_RES_CTLR_KMEM
+	bool "Memory Resource Controller Kernel Memory accounting"
+	depends on CGROUP_MEM_RES_CTLR
+	default y
+	help
+	  The Kernel Memory extension for Memory Resource Controller can limit
+	  the amount of memory used by kernel objects in the system. Those are
+	  fundamentally different from the entities handled by the standard
+	  Memory Controller, which are page-based, and can be swapped. Users of
+	  the kmem extension can use it to guarantee that no group of processes
+	  will ever exhaust kernel resources alone.
 
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebd1e86..d32e931 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
 #define do_swap_account		(0)
 #endif
 
-
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+int do_kmem_account __read_mostly = 1;
+#else
+#define do_kmem_account		0
+#endif
 /*
  * Statistics for memory cgroup.
  */
@@ -270,6 +274,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter memsw;
 	/*
+	 * the counter to account for kmem usage.
+	 */
+	struct res_counter kmem;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -321,6 +329,11 @@ struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 	/*
+	 * Should kernel memory limits be stabilished independently
+	 * from user memory ?
+	 */
+	int		kmem_independent;
+	/*
 	 * percpu counter.
 	 */
 	struct mem_cgroup_stat_cpu *stat;
@@ -388,9 +401,14 @@ enum charge_type {
 };
 
 /* for encoding cft->private value on file */
-#define _MEM			(0)
-#define _MEMSWAP		(1)
-#define _OOM_TYPE		(2)
+
+enum mem_type {
+	_MEM = 0,
+	_MEMSWAP,
+	_OOM_TYPE,
+	_KMEM,
+};
+
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
@@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
 	u64 val;
 
 	if (!mem_cgroup_is_root(mem)) {
+		val = 0;
+		if (!mem->kmem_independent)
+			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
 		if (!swap)
-			return res_counter_read_u64(&mem->res, RES_USAGE);
+			val += res_counter_read_u64(&mem->res, RES_USAGE);
 		else
-			return res_counter_read_u64(&mem->memsw, RES_USAGE);
+			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
+
+		return val;
 	}
 
 	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
@@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		else
 			val = res_counter_read_u64(&mem->memsw, name);
 		break;
+	case _KMEM:
+		val = res_counter_read_u64(&mem->kmem, name);
+		break;
+
 	default:
 		BUG();
 		break;
@@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
 	return 0;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cont)->kmem_independent;
+}
+
+static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
+					u64 val)
+{
+	cgroup_lock();
+	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
+	cgroup_unlock();
+	return 0;
+}
+#endif
 
 static struct cftype mem_cgroup_files[] = {
 	{
@@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
 }
 #endif
 
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static struct cftype kmem_cgroup_files[] = {
+	{
+		.name = "independent_kmem_limit",
+		.read_u64 = kmem_limit_independent_read,
+		.write_u64 = kmem_limit_independent_write,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	int ret = 0;
+
+	if (!do_kmem_account)
+		return 0;
+
+	if (!mem_cgroup_is_root(mem))
+		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
+					ARRAY_SIZE(kmem_cgroup_files));
+	return ret;
+};
+
+#else
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	return 0;
+}
+#endif
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent && parent->use_hierarchy) {
 		res_counter_init(&mem->res, &parent->res);
 		res_counter_init(&mem->memsw, &parent->memsw);
+		res_counter_init(&mem->kmem, &parent->kmem);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	} else {
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
+		res_counter_init(&mem->kmem, NULL);
 	}
 	mem->last_scanned_child = 0;
 	mem->last_scanned_node = MAX_NUMNODES;
@@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 
 	if (!ret)
 		ret = register_memsw_files(cont, ss);
+
+	if (!ret)
+		ret = register_kmem_files(cont, ss);
+
 	return ret;
 }
 
@@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static int __init disable_kmem_account(char *s)
+{
+	/* consider enabled if no parameter or 1 is given */
+	if (!strcmp(s, "1"))
+		do_kmem_account = 1;
+	else if (!strcmp(s, "0"))
+		do_kmem_account = 0;
+	return 1;
+}
+__setup("kmemaccount=", disable_kmem_account);
+
+#endif
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch lays down the foundation for the kernel memory component
of the Memory Controller.

As of today, I am only laying down the following files:

 * memory.independent_kmem_limit
 * memory.kmem.limit_in_bytes (currently ignored)
 * memory.kmem.usage_in_bytes (always zero)

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
---
 Documentation/cgroups/memory.txt |   30 +++++++++-
 init/Kconfig                     |   11 ++++
 mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 148 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f3c598..6f1954a 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -44,8 +44,9 @@ Features:
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
 
- Kernel memory and Hugepages are not under control yet. We just manage
- pages on LRU. To add more controls, we have to take care of performance.
+ Hugepages is not under control yet. We just manage pages on LRU. To add more
+ controls, we have to take care of performance. Kernel memory support is work
+ in progress, and the current version provides basically functionality.
 
 Brief summary of control files.
 
@@ -56,8 +57,11 @@ Brief summary of control files.
 				 (See 5.5 for details)
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
+ memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
+				 (See 2.7 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
+ memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
@@ -72,6 +76,9 @@ Brief summary of control files.
  memory.oom_control		 # set/show oom controls.
  memory.numa_stat		 # show the number of memory usage per numa node
 
+ memory.independent_kmem_limit	 # select whether or not kernel memory limits are
+				   independent of user limits
+
 1. History
 
 The memory controller has a long history. A request for comments for the memory
@@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
   zone->lru_lock, it has no lock of its own.
 
+2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+
+ With the Kernel memory extension, the Memory Controller is able to limit
+the amount of kernel memory used by the system. Kernel memory is fundamentally
+different than user memory, since it can't be swapped out, which makes it
+possible to DoS the system by consuming too much of this precious resource.
+Kernel memory limits are not imposed for the root cgroup.
+
+Memory limits as specified by the standard Memory Controller may or may not
+take kernel memory into consideration. This is achieved through the file
+memory.independent_kmem_limit. A Value different than 0 will allow for kernel
+memory to be controlled separately.
+
+When kernel memory limits are not independent, the limit values set in
+memory.kmem files are ignored.
+
+Currently no soft limit is implemented for kernel memory. It is future work
+to trigger slab reclaim when those limits are reached.
+
 3. User Interface
 
 0. Configuration
diff --git a/init/Kconfig b/init/Kconfig
index d627783..49e5839 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  For those who want to have the feature enabled by default should
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
+config CGROUP_MEM_RES_CTLR_KMEM
+	bool "Memory Resource Controller Kernel Memory accounting"
+	depends on CGROUP_MEM_RES_CTLR
+	default y
+	help
+	  The Kernel Memory extension for Memory Resource Controller can limit
+	  the amount of memory used by kernel objects in the system. Those are
+	  fundamentally different from the entities handled by the standard
+	  Memory Controller, which are page-based, and can be swapped. Users of
+	  the kmem extension can use it to guarantee that no group of processes
+	  will ever exhaust kernel resources alone.
 
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebd1e86..d32e931 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
 #define do_swap_account		(0)
 #endif
 
-
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+int do_kmem_account __read_mostly = 1;
+#else
+#define do_kmem_account		0
+#endif
 /*
  * Statistics for memory cgroup.
  */
@@ -270,6 +274,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter memsw;
 	/*
+	 * the counter to account for kmem usage.
+	 */
+	struct res_counter kmem;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -321,6 +329,11 @@ struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 	/*
+	 * Should kernel memory limits be stabilished independently
+	 * from user memory ?
+	 */
+	int		kmem_independent;
+	/*
 	 * percpu counter.
 	 */
 	struct mem_cgroup_stat_cpu *stat;
@@ -388,9 +401,14 @@ enum charge_type {
 };
 
 /* for encoding cft->private value on file */
-#define _MEM			(0)
-#define _MEMSWAP		(1)
-#define _OOM_TYPE		(2)
+
+enum mem_type {
+	_MEM = 0,
+	_MEMSWAP,
+	_OOM_TYPE,
+	_KMEM,
+};
+
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
@@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
 	u64 val;
 
 	if (!mem_cgroup_is_root(mem)) {
+		val = 0;
+		if (!mem->kmem_independent)
+			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
 		if (!swap)
-			return res_counter_read_u64(&mem->res, RES_USAGE);
+			val += res_counter_read_u64(&mem->res, RES_USAGE);
 		else
-			return res_counter_read_u64(&mem->memsw, RES_USAGE);
+			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
+
+		return val;
 	}
 
 	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
@@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		else
 			val = res_counter_read_u64(&mem->memsw, name);
 		break;
+	case _KMEM:
+		val = res_counter_read_u64(&mem->kmem, name);
+		break;
+
 	default:
 		BUG();
 		break;
@@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
 	return 0;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cont)->kmem_independent;
+}
+
+static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
+					u64 val)
+{
+	cgroup_lock();
+	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
+	cgroup_unlock();
+	return 0;
+}
+#endif
 
 static struct cftype mem_cgroup_files[] = {
 	{
@@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
 }
 #endif
 
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static struct cftype kmem_cgroup_files[] = {
+	{
+		.name = "independent_kmem_limit",
+		.read_u64 = kmem_limit_independent_read,
+		.write_u64 = kmem_limit_independent_write,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	int ret = 0;
+
+	if (!do_kmem_account)
+		return 0;
+
+	if (!mem_cgroup_is_root(mem))
+		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
+					ARRAY_SIZE(kmem_cgroup_files));
+	return ret;
+};
+
+#else
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	return 0;
+}
+#endif
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent && parent->use_hierarchy) {
 		res_counter_init(&mem->res, &parent->res);
 		res_counter_init(&mem->memsw, &parent->memsw);
+		res_counter_init(&mem->kmem, &parent->kmem);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	} else {
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
+		res_counter_init(&mem->kmem, NULL);
 	}
 	mem->last_scanned_child = 0;
 	mem->last_scanned_node = MAX_NUMNODES;
@@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 
 	if (!ret)
 		ret = register_memsw_files(cont, ss);
+
+	if (!ret)
+		ret = register_kmem_files(cont, ss);
+
 	return ret;
 }
 
@@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static int __init disable_kmem_account(char *s)
+{
+	/* consider enabled if no parameter or 1 is given */
+	if (!strcmp(s, "1"))
+		do_kmem_account = 1;
+	else if (!strcmp(s, "0"))
+		do_kmem_account = 0;
+	return 1;
+}
+__setup("kmemaccount=", disable_kmem_account);
+
+#endif
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

We aim to control the amount of kernel memory pinned at any
time by tcp sockets. To lay the foundations for this work,
this patch adds a pointer to the kmem_cgroup to the socket
structure.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 include/net/sock.h         |    2 ++
 mm/memcontrol.c            |   33 +++++++++++++++++++++++++++++++++
 net/core/sock.c            |    3 +++
 4 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3b535db..2cb9226 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -395,5 +395,20 @@ mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_INET
+struct sock;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+void sock_update_memcg(struct sock *sk);
+void sock_release_memcg(struct sock *sk);
+
+#else
+static inline void sock_update_memcg(struct sock *sk)
+{
+}
+static inline void sock_release_memcg(struct sock *sk)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+#endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..afe1467 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -228,6 +228,7 @@ struct sock_common {
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
+  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -339,6 +340,7 @@ struct sock {
 #endif
 	__u32			sk_mark;
 	u32			sk_classid;
+	struct mem_cgroup	*sk_cgrp;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d32e931..0a7d335 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,39 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 };
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#ifdef CONFIG_INET
+#include <net/sock.h>
+
+void sock_update_memcg(struct sock *sk)
+{
+	/* right now a socket spends its whole life in the same cgroup */
+	BUG_ON(sk->sk_cgrp);
+
+	rcu_read_lock();
+	sk->sk_cgrp = mem_cgroup_from_task(current);
+
+	/*
+	 * We don't need to protect against anything task-related, because
+	 * we are basically stuck with the sock pointer that won't change,
+	 * even if the task that originated the socket changes cgroups.
+	 *
+	 * What we do have to guarantee, is that the chain leading us to
+	 * the top level won't change under our noses. Incrementing the
+	 * reference count via cgroup_exclude_rmdir guarantees that.
+	 */
+	cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
+	rcu_read_unlock();
+}
+
+void sock_release_memcg(struct sock *sk)
+{
+	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
+}
+#endif /* CONFIG_INET */
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 /* Stuffs for move charges at task migration. */
 /*
  * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
diff --git a/net/core/sock.c b/net/core/sock.c
index 3449df8..54ec8ac 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -125,6 +125,7 @@
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
 #include <net/cls_cgroup.h>
+#include <linux/memcontrol.h>
 
 #include <linux/filter.h>
 
@@ -1139,6 +1140,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_memcg(sk);
 	}
 
 	return sk;
@@ -1170,6 +1172,7 @@ static void __sk_free(struct sock *sk)
 		put_cred(sk->sk_peer_cred);
 	put_pid(sk->sk_peer_pid);
 	put_net(sock_net(sk));
+	sock_release_memcg(sk);
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

We aim to control the amount of kernel memory pinned at any
time by tcp sockets. To lay the foundations for this work,
this patch adds a pointer to the kmem_cgroup to the socket
structure.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 include/net/sock.h         |    2 ++
 mm/memcontrol.c            |   33 +++++++++++++++++++++++++++++++++
 net/core/sock.c            |    3 +++
 4 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3b535db..2cb9226 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -395,5 +395,20 @@ mem_cgroup_print_bad_page(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_INET
+struct sock;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+void sock_update_memcg(struct sock *sk);
+void sock_release_memcg(struct sock *sk);
+
+#else
+static inline void sock_update_memcg(struct sock *sk)
+{
+}
+static inline void sock_release_memcg(struct sock *sk)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+#endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..afe1467 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -228,6 +228,7 @@ struct sock_common {
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
+  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -339,6 +340,7 @@ struct sock {
 #endif
 	__u32			sk_mark;
 	u32			sk_classid;
+	struct mem_cgroup	*sk_cgrp;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d32e931..0a7d335 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,39 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 };
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#ifdef CONFIG_INET
+#include <net/sock.h>
+
+void sock_update_memcg(struct sock *sk)
+{
+	/* right now a socket spends its whole life in the same cgroup */
+	BUG_ON(sk->sk_cgrp);
+
+	rcu_read_lock();
+	sk->sk_cgrp = mem_cgroup_from_task(current);
+
+	/*
+	 * We don't need to protect against anything task-related, because
+	 * we are basically stuck with the sock pointer that won't change,
+	 * even if the task that originated the socket changes cgroups.
+	 *
+	 * What we do have to guarantee, is that the chain leading us to
+	 * the top level won't change under our noses. Incrementing the
+	 * reference count via cgroup_exclude_rmdir guarantees that.
+	 */
+	cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
+	rcu_read_unlock();
+}
+
+void sock_release_memcg(struct sock *sk)
+{
+	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
+}
+#endif /* CONFIG_INET */
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 /* Stuffs for move charges at task migration. */
 /*
  * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
diff --git a/net/core/sock.c b/net/core/sock.c
index 3449df8..54ec8ac 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -125,6 +125,7 @@
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
 #include <net/cls_cgroup.h>
+#include <linux/memcontrol.h>
 
 #include <linux/filter.h>
 
@@ -1139,6 +1140,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_memcg(sk);
 	}
 
 	return sk;
@@ -1170,6 +1172,7 @@ static void __sk_free(struct sock *sk)
 		put_cred(sk->sk_peer_cred);
 	put_pid(sk->sk_peer_pid);
 	put_net(sock_net(sk));
+	sock_release_memcg(sk);
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 3/7] foundations of per-cgroup memory pressure controlling.
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch converts struct sock fields memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem (now prot_mem)
to function pointers, receiving a struct mem_cgroup parameter.

enter_memory_pressure is kept the same, since all its callers
have socket a context, and the kmem_cgroup can be derived from
the socket itself.

To keep things working, the patch convert all users of those fields
to use acessor functions.

In my benchmarks I didn't see a significant performance difference
with this patch applied compared to a baseline (around 1 % diff, thus
inside error margin).

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 crypto/af_alg.c             |    7 ++-
 include/linux/memcontrol.h  |   29 +++++++++++-
 include/net/sock.h          |  112 +++++++++++++++++++++++++++++++++++++++++--
 include/net/tcp.h           |   11 +++--
 include/net/udp.h           |    3 +-
 include/trace/events/sock.h |   10 ++--
 mm/memcontrol.c             |   45 +++++++++++++++++-
 net/core/sock.c             |   62 ++++++++++++++----------
 net/decnet/af_decnet.c      |   21 +++++++-
 net/ipv4/proc.c             |    7 ++-
 net/ipv4/tcp.c              |   27 +++++++++-
 net/ipv4/tcp_input.c        |   12 ++--
 net/ipv4/tcp_ipv4.c         |   12 ++--
 net/ipv4/tcp_output.c       |    2 +-
 net/ipv4/tcp_timer.c        |    2 +-
 net/ipv4/udp.c              |   20 ++++++--
 net/ipv6/tcp_ipv6.c         |   10 ++--
 net/ipv6/udp.c              |    4 +-
 net/sctp/socket.c           |   35 ++++++++++---
 19 files changed, 345 insertions(+), 86 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index ac33d5f..c21351c 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -29,10 +29,15 @@ struct alg_type_list {
 
 static atomic_long_t alg_memory_allocated;
 
+static atomic_long_t *memory_allocated_alg(struct mem_cgroup *sg)
+{
+	return &alg_memory_allocated;
+}
+
 static struct proto alg_proto = {
 	.name			= "ALG",
 	.owner			= THIS_MODULE,
-	.memory_allocated	= &alg_memory_allocated,
+	.memory_allocated	= memory_allocated_alg,
 	.obj_size		= sizeof(struct alg_sock),
 };
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cb9226..1744ae8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -380,6 +380,10 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+	return NULL;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
@@ -397,11 +401,34 @@ mem_cgroup_print_bad_page(struct page *page)
 
 #ifdef CONFIG_INET
 struct sock;
+struct proto; 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-
+void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
+			  int amt, int *parent_failure);
+void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt);
+void memcg_sockets_allocated_dec(struct mem_cgroup *mem, struct proto *prot);
+void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot);
 #else
+/* memcontrol includes sockets.h, that includes memcontrol.h ... */
+static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
+					struct proto *prot, int amt,
+					int *parent_failure)
+{
+}
+static inline void memcg_sock_mem_free(struct mem_cgroup *mem,
+				       struct proto *prot, int amt)
+{
+}
+static inline void memcg_sockets_allocated_dec(struct mem_cgroup *mem,
+					       struct proto *prot)
+{
+}
+static inline void memcg_sockets_allocated_inc(struct mem_cgroup *mem,
+					       struct proto *prot)
+{
+}
 static inline void sock_update_memcg(struct sock *sk)
 {
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index afe1467..78832f9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,6 +54,7 @@
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -168,6 +169,8 @@ struct sock_common {
 	/* public: */
 };
 
+struct mem_cgroup;
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -786,18 +789,32 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
+	/*
+	 * per-cgroup memory tracking:
+	 *
+	 * The following functions track memory consumption of network buffers
+	 * by cgroup (kmem_cgroup) for the current protocol. As of the rest
+	 * of the fields in this structure, not all protocols are required
+	 * to implement them. Protocols that don't want to do per-cgroup
+	 * memory pressure management, can just assume the root cgroup is used.
+	 *
+	 */
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
+	/* Pointer to the current memory allocation of this cgroup. */
+	atomic_long_t		*(*memory_allocated)(struct mem_cgroup *sg);
+	/* Pointer to the current number of sockets in this cgroup. */
+	struct percpu_counter	*(*sockets_allocated)(struct mem_cgroup *sg);
 	/*
-	 * Pressure flag: try to collapse.
+	 * Per cgroup pointer to the pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
 	 * All the __sk_mem_schedule() is of this nature: accounting
 	 * is strict, actions are advisory and have some latency.
 	 */
-	int			*memory_pressure;
-	long			*sysctl_mem;
+	int			*(*memory_pressure)(struct mem_cgroup *sg);
+	/* Pointer to the per-cgroup version of the the sysctl_mem field */
+	long			*(*prot_mem)(struct mem_cgroup *sg);
+
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
@@ -856,6 +873,91 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
+#include <linux/memcontrol.h>
+static inline int *sk_memory_pressure(struct sock *sk)
+{
+	int *ret = NULL;
+	if (sk->sk_prot->memory_pressure)
+		ret = sk->sk_prot->memory_pressure(sk->sk_cgrp);
+	return ret;
+}
+
+static inline long sk_prot_mem(struct sock *sk, int index)
+{
+	long *prot = sk->sk_prot->prot_mem(sk->sk_cgrp);
+	return prot[index];
+}
+
+static inline long
+sk_memory_allocated(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
+static inline long
+sk_memory_allocated_add(struct sock *sk, int amt, int *parent_failure)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+	long allocated;
+
+	allocated = atomic_long_add_return(amt, prot->memory_allocated(cg));
+	memcg_sock_mem_alloc(cg, prot, amt, parent_failure);
+	return allocated;
+}
+
+static inline void
+sk_memory_allocated_sub(struct sock *sk, int amt)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	atomic_long_sub(amt, prot->memory_allocated(cg));
+	memcg_sock_mem_free(cg, prot, amt);
+}
+
+static inline void sk_sockets_allocated_dec(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_dec(prot->sockets_allocated(cg));
+	memcg_sockets_allocated_dec(cg, prot);
+}
+
+static inline void sk_sockets_allocated_inc(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_inc(prot->sockets_allocated(cg));
+	memcg_sockets_allocated_inc(cg, prot);
+}
+
+static inline int
+sk_sockets_allocated_read_positive(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline int
+kcg_sockets_allocated_sum_positive(struct proto *prot, struct mem_cgroup *cg)
+{
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline long
+kcg_memory_allocated(struct proto *prot, struct mem_cgroup *cg)
+{
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
 
 #ifdef CONFIG_PROC_FS
 /* Called with local bh disabled */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..c835ae3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -45,6 +45,7 @@
 #include <net/dst.h>
 
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -253,9 +254,11 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
-extern atomic_long_t tcp_memory_allocated;
-extern struct percpu_counter tcp_sockets_allocated;
-extern int tcp_memory_pressure;
+struct mem_cgroup;
+extern long *tcp_sysctl_mem(struct mem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg);
+int *memory_pressure_tcp(struct mem_cgroup *sg);
+atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -286,7 +289,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    sk_memory_allocated(sk) > sk_prot_mem(sk, 2))
 		return true;
 	return false;
 }
diff --git a/include/net/udp.h b/include/net/udp.h
index 67ea6fc..6ed4cc6 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
-extern atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct mem_cgroup *sg);
+long *udp_sysctl_mem(struct mem_cgroup *sg);
 
 /* sysctl variables for udp */
 extern long sysctl_udp_mem[3];
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 779abb9..12a6083 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__field(long *, prot_mem)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = prot->sysctl_rmem[0];
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
@@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
 		"sysctl_rmem=%d rmem_alloc=%d",
 		__entry->name,
-		__entry->sysctl_mem[0],
-		__entry->sysctl_mem[1],
-		__entry->sysctl_mem[2],
+		__entry->prot_mem[0],
+		__entry->prot_mem[1],
+		__entry->prot_mem[2],
 		__entry->allocated,
 		__entry->sysctl_rmem,
 		__entry->rmem_alloc)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a7d335..03d6d61 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 };
 
+static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
@@ -375,6 +376,49 @@ void sock_release_memcg(struct sock *sk)
 {
 	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
 }
+
+void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
+			  int amt, int *parent_failure)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem != NULL; mem = parent_mem_cgroup(mem)) {
+		long alloc;
+		long *prot_mem = prot->prot_mem(mem);
+		/*
+		 * Large nestings are not the common case, and stopping in the
+		 * middle would be complicated enough, that we bill it all the
+		 * way through the root, and if needed, unbill everything later
+		 */
+		alloc = atomic_long_add_return(amt,
+					       prot->memory_allocated(mem));
+		*parent_failure |= (alloc > prot_mem[2]);
+	}
+}
+EXPORT_SYMBOL(memcg_sock_mem_alloc);
+
+void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem != NULL; mem = parent_mem_cgroup(mem))
+		atomic_long_sub(amt, prot->memory_allocated(mem));
+}
+EXPORT_SYMBOL(memcg_sock_mem_free);
+
+void memcg_sockets_allocated_dec(struct mem_cgroup *mem, struct proto *prot)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem; mem = parent_mem_cgroup(mem))
+		percpu_counter_dec(prot->sockets_allocated(mem));
+}
+EXPORT_SYMBOL(memcg_sockets_allocated_dec);
+
+void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem; mem = parent_mem_cgroup(mem))
+		percpu_counter_inc(prot->sockets_allocated(mem));
+}
+EXPORT_SYMBOL(memcg_sockets_allocated_inc);
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -460,7 +504,6 @@ enum mem_type {
 
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
-static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
diff --git a/net/core/sock.c b/net/core/sock.c
index 54ec8ac..338d572 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1291,7 +1291,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		newsk->sk_wq = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+			sk_sockets_allocated_inc(newsk);
 
 		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
 		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1682,30 +1682,33 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
+	int *memory_pressure;
+	int parent_failure = 0;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_long_add_return(amt, prot->memory_allocated);
+
+	memory_pressure = sk_memory_pressure(sk);
+	allocated = sk_memory_allocated_add(sk, amt, &parent_failure);
+
+	/* Over hard limit (we, or our parents) */
+	if (parent_failure || (allocated > sk_prot_mem(sk, 2)))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (allocated <= prot->sysctl_mem[0]) {
-		if (prot->memory_pressure && *prot->memory_pressure)
-			*prot->memory_pressure = 0;
-		return 1;
-	}
+	if (allocated <= sk_prot_mem(sk, 0))
+		if (memory_pressure && *memory_pressure)
+			*memory_pressure = 0;
 
 	/* Under pressure. */
-	if (allocated > prot->sysctl_mem[1])
+	if (allocated > sk_prot_mem(sk, 1))
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
-	/* Over hard limit. */
-	if (allocated > prot->sysctl_mem[2])
-		goto suppress_allocation;
-
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
 			return 1;
+
 	} else { /* SK_MEM_SEND */
 		if (sk->sk_type == SOCK_STREAM) {
 			if (sk->sk_wmem_queued < prot->sysctl_wmem[0])
@@ -1715,13 +1718,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 				return 1;
 	}
 
-	if (prot->memory_pressure) {
+	if (memory_pressure) {
 		int alloc;
 
-		if (!*prot->memory_pressure)
+		if (!*memory_pressure)
 			return 1;
-		alloc = percpu_counter_read_positive(prot->sockets_allocated);
-		if (prot->sysctl_mem[2] > alloc *
+		alloc = sk_sockets_allocated_read_positive(sk);
+		if (sk_prot_mem(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
 				 sk->sk_forward_alloc))
@@ -1744,7 +1747,9 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_long_sub(amt, prot->memory_allocated);
+
+	sk_memory_allocated_sub(sk, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1755,15 +1760,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reclaim(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
+	int *memory_pressure = sk_memory_pressure(sk);
 
-	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
+	sk_memory_allocated_sub(sk,
+				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	if (memory_pressure && *memory_pressure &&
+	    (sk_memory_allocated(sk) < sk_prot_mem(sk, 0)))
+		*memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
@@ -2482,13 +2487,20 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
+	struct mem_cgroup *cg = mem_cgroup_from_task(current);
+	int *memory_pressure = NULL;
+
+	if (proto->memory_pressure)
+		memory_pressure = proto->memory_pressure(cg);
+
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
-		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+		   proto->memory_allocated != NULL ?
+			kcg_memory_allocated(proto, cg) : -1L,
+		   memory_pressure != NULL ? *memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
 		   module_name(proto->owner),
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 19acd00..4752118 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
 	}
 }
 
+static atomic_long_t *memory_allocated_dn(struct mem_cgroup *sg)
+{
+	return &decnet_memory_allocated;
+}
+
+static int *memory_pressure_dn(struct mem_cgroup *sg)
+{
+	return &dn_memory_pressure;
+}
+
+static long *dn_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_decnet_mem;
+}
+
 static struct proto dn_proto = {
 	.name			= "NSP",
 	.owner			= THIS_MODULE,
 	.enter_memory_pressure	= dn_enter_memory_pressure,
-	.memory_pressure	= &dn_memory_pressure,
-	.memory_allocated	= &decnet_memory_allocated,
-	.sysctl_mem		= sysctl_decnet_mem,
+	.memory_pressure	= memory_pressure_dn,
+	.memory_allocated	= memory_allocated_dn,
+	.prot_mem		= dn_sysctl_mem,
 	.sysctl_wmem		= sysctl_decnet_wmem,
 	.sysctl_rmem		= sysctl_decnet_rmem,
 	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b14ec7d..ba56702 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -52,20 +52,21 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
 	int orphans, sockets;
+	struct mem_cgroup *cg = mem_cgroup_from_task(current);
 
 	local_bh_disable();
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
-	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+	sockets = kcg_sockets_allocated_sum_positive(&tcp_prot, cg);
 	local_bh_enable();
 
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_long_read(&tcp_memory_allocated));
+		   kcg_memory_allocated(&tcp_prot, cg));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_long_read(&udp_memory_allocated));
+		   kcg_memory_allocated(&udp_prot, cg));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..452245f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -291,13 +291,11 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-EXPORT_SYMBOL(tcp_memory_allocated);
 
 /*
  * Current number of TCP sockets.
  */
 struct percpu_counter tcp_sockets_allocated;
-EXPORT_SYMBOL(tcp_sockets_allocated);
 
 /*
  * TCP splice context
@@ -315,7 +313,18 @@ struct tcp_splice_state {
  * is strict, actions are advisory and have some latency.
  */
 int tcp_memory_pressure __read_mostly;
-EXPORT_SYMBOL(tcp_memory_pressure);
+
+int *memory_pressure_tcp(struct mem_cgroup *sg)
+{
+	return &tcp_memory_pressure;
+}
+EXPORT_SYMBOL(memory_pressure_tcp);
+
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
+{
+	return &tcp_sockets_allocated;
+}
+EXPORT_SYMBOL(sockets_allocated_tcp);
 
 void tcp_enter_memory_pressure(struct sock *sk)
 {
@@ -326,6 +335,18 @@ void tcp_enter_memory_pressure(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
 
+long *tcp_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_tcp_mem;
+}
+EXPORT_SYMBOL(tcp_sysctl_mem);
+
+atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg)
+{
+	return &tcp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_tcp);
+
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..3f17423 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
 	/* Check #1 */
 	if (tp->rcv_ssthresh < tp->window_clamp &&
 	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-	    !tcp_memory_pressure) {
+	    !sk_memory_pressure(sk)) {
 		int incr;
 
 		/* Check #2. Increase window, if skb with such overhead
@@ -398,8 +398,8 @@ static void tcp_clamp_window(struct sock *sk)
 
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
-	    !tcp_memory_pressure &&
-	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    !sk_memory_pressure(sk) &&
+	    sk_memory_allocated(sk) < sk_prot_mem(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -4806,7 +4806,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		tcp_clamp_window(sk);
-	else if (tcp_memory_pressure)
+	else if (sk_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
 	tcp_collapse_ofo_queue(sk);
@@ -4872,11 +4872,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
 		return 0;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (sk_memory_allocated(sk) >= sk_prot_mem(sk, 0))
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1c12b8e..cbb0d5e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1901,7 +1901,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -1957,7 +1957,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
-	percpu_counter_dec(&tcp_sockets_allocated);
+	sk_sockets_allocated_dec(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2598,11 +2598,11 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.memory_allocated	= memory_allocated_tcp,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..06aeb31 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_memory_pressure)
+		if (sk_memory_pressure(sk))
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..2c67617 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
 	}
 
 out:
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		sk_mem_reclaim(sk);
 out_unlock:
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1b5a193..cc7627b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
 int sysctl_udp_wmem_min __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_wmem_min);
 
-atomic_long_t udp_memory_allocated;
-EXPORT_SYMBOL(udp_memory_allocated);
-
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
@@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(udp_poll);
 
+static atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct mem_cgroup *sg)
+{
+	return &udp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_udp);
+
+long *udp_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_udp_mem;
+}
+EXPORT_SYMBOL(udp_sysctl_mem);
+
 struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
@@ -1936,8 +1946,8 @@ struct proto udp_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = &memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp_sock),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d1fb63f..807797a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2012,7 +2012,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -2221,11 +2221,11 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 29213b5..ef4b5b3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp6_sock),
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 836aa63..0ddf561 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -119,11 +119,30 @@ static int sctp_memory_pressure;
 static atomic_long_t sctp_memory_allocated;
 struct percpu_counter sctp_sockets_allocated;
 
+static long *sctp_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_sctp_mem;
+}
+
 static void sctp_enter_memory_pressure(struct sock *sk)
 {
 	sctp_memory_pressure = 1;
 }
 
+static int *memory_pressure_sctp(struct mem_cgroup *sg)
+{
+	return &sctp_memory_pressure;
+}
+
+static atomic_long_t *memory_allocated_sctp(struct mem_cgroup *sg)
+{
+	return &sctp_memory_allocated;
+}
+
+static struct percpu_counter *sockets_allocated_sctp(struct mem_cgroup *sg)
+{
+	return &sctp_sockets_allocated;
+}
 
 /* Get the sndbuf space available at the time on the association.  */
 static inline int sctp_wspace(struct sctp_association *asoc)
@@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
 	.unhash      =	sctp_unhash,
 	.get_port    =	sctp_get_port,
 	.obj_size    =  sizeof(struct sctp_sock),
-	.sysctl_mem  =  sysctl_sctp_mem,
+	.prot_mem    =  sctp_sysctl_mem,
 	.sysctl_rmem =  sysctl_sctp_rmem,
 	.sysctl_wmem =  sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
 	.unhash		= sctp_unhash,
 	.get_port	= sctp_get_port,
 	.obj_size	= sizeof(struct sctp6_sock),
-	.sysctl_mem	= sysctl_sctp_mem,
+	.prot_mem	= sctp_sysctl_mem,
 	.sysctl_rmem	= sysctl_sctp_rmem,
 	.sysctl_wmem	= sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 3/7] foundations of per-cgroup memory pressure controlling.
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch converts struct sock fields memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem (now prot_mem)
to function pointers, receiving a struct mem_cgroup parameter.

enter_memory_pressure is kept the same, since all its callers
have socket a context, and the kmem_cgroup can be derived from
the socket itself.

To keep things working, the patch convert all users of those fields
to use acessor functions.

In my benchmarks I didn't see a significant performance difference
with this patch applied compared to a baseline (around 1 % diff, thus
inside error margin).

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 crypto/af_alg.c             |    7 ++-
 include/linux/memcontrol.h  |   29 +++++++++++-
 include/net/sock.h          |  112 +++++++++++++++++++++++++++++++++++++++++--
 include/net/tcp.h           |   11 +++--
 include/net/udp.h           |    3 +-
 include/trace/events/sock.h |   10 ++--
 mm/memcontrol.c             |   45 +++++++++++++++++-
 net/core/sock.c             |   62 ++++++++++++++----------
 net/decnet/af_decnet.c      |   21 +++++++-
 net/ipv4/proc.c             |    7 ++-
 net/ipv4/tcp.c              |   27 +++++++++-
 net/ipv4/tcp_input.c        |   12 ++--
 net/ipv4/tcp_ipv4.c         |   12 ++--
 net/ipv4/tcp_output.c       |    2 +-
 net/ipv4/tcp_timer.c        |    2 +-
 net/ipv4/udp.c              |   20 ++++++--
 net/ipv6/tcp_ipv6.c         |   10 ++--
 net/ipv6/udp.c              |    4 +-
 net/sctp/socket.c           |   35 ++++++++++---
 19 files changed, 345 insertions(+), 86 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index ac33d5f..c21351c 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -29,10 +29,15 @@ struct alg_type_list {
 
 static atomic_long_t alg_memory_allocated;
 
+static atomic_long_t *memory_allocated_alg(struct mem_cgroup *sg)
+{
+	return &alg_memory_allocated;
+}
+
 static struct proto alg_proto = {
 	.name			= "ALG",
 	.owner			= THIS_MODULE,
-	.memory_allocated	= &alg_memory_allocated,
+	.memory_allocated	= memory_allocated_alg,
 	.obj_size		= sizeof(struct alg_sock),
 };
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cb9226..1744ae8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -380,6 +380,10 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+	return NULL;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
@@ -397,11 +401,34 @@ mem_cgroup_print_bad_page(struct page *page)
 
 #ifdef CONFIG_INET
 struct sock;
+struct proto; 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-
+void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
+			  int amt, int *parent_failure);
+void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt);
+void memcg_sockets_allocated_dec(struct mem_cgroup *mem, struct proto *prot);
+void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot);
 #else
+/* memcontrol includes sockets.h, that includes memcontrol.h ... */
+static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
+					struct proto *prot, int amt,
+					int *parent_failure)
+{
+}
+static inline void memcg_sock_mem_free(struct mem_cgroup *mem,
+				       struct proto *prot, int amt)
+{
+}
+static inline void memcg_sockets_allocated_dec(struct mem_cgroup *mem,
+					       struct proto *prot)
+{
+}
+static inline void memcg_sockets_allocated_inc(struct mem_cgroup *mem,
+					       struct proto *prot)
+{
+}
 static inline void sock_update_memcg(struct sock *sk)
 {
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index afe1467..78832f9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,6 +54,7 @@
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -168,6 +169,8 @@ struct sock_common {
 	/* public: */
 };
 
+struct mem_cgroup;
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -786,18 +789,32 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
+	/*
+	 * per-cgroup memory tracking:
+	 *
+	 * The following functions track memory consumption of network buffers
+	 * by cgroup (kmem_cgroup) for the current protocol. As of the rest
+	 * of the fields in this structure, not all protocols are required
+	 * to implement them. Protocols that don't want to do per-cgroup
+	 * memory pressure management, can just assume the root cgroup is used.
+	 *
+	 */
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
+	/* Pointer to the current memory allocation of this cgroup. */
+	atomic_long_t		*(*memory_allocated)(struct mem_cgroup *sg);
+	/* Pointer to the current number of sockets in this cgroup. */
+	struct percpu_counter	*(*sockets_allocated)(struct mem_cgroup *sg);
 	/*
-	 * Pressure flag: try to collapse.
+	 * Per cgroup pointer to the pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
 	 * All the __sk_mem_schedule() is of this nature: accounting
 	 * is strict, actions are advisory and have some latency.
 	 */
-	int			*memory_pressure;
-	long			*sysctl_mem;
+	int			*(*memory_pressure)(struct mem_cgroup *sg);
+	/* Pointer to the per-cgroup version of the the sysctl_mem field */
+	long			*(*prot_mem)(struct mem_cgroup *sg);
+
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
@@ -856,6 +873,91 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
+#include <linux/memcontrol.h>
+static inline int *sk_memory_pressure(struct sock *sk)
+{
+	int *ret = NULL;
+	if (sk->sk_prot->memory_pressure)
+		ret = sk->sk_prot->memory_pressure(sk->sk_cgrp);
+	return ret;
+}
+
+static inline long sk_prot_mem(struct sock *sk, int index)
+{
+	long *prot = sk->sk_prot->prot_mem(sk->sk_cgrp);
+	return prot[index];
+}
+
+static inline long
+sk_memory_allocated(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
+static inline long
+sk_memory_allocated_add(struct sock *sk, int amt, int *parent_failure)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+	long allocated;
+
+	allocated = atomic_long_add_return(amt, prot->memory_allocated(cg));
+	memcg_sock_mem_alloc(cg, prot, amt, parent_failure);
+	return allocated;
+}
+
+static inline void
+sk_memory_allocated_sub(struct sock *sk, int amt)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	atomic_long_sub(amt, prot->memory_allocated(cg));
+	memcg_sock_mem_free(cg, prot, amt);
+}
+
+static inline void sk_sockets_allocated_dec(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_dec(prot->sockets_allocated(cg));
+	memcg_sockets_allocated_dec(cg, prot);
+}
+
+static inline void sk_sockets_allocated_inc(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_inc(prot->sockets_allocated(cg));
+	memcg_sockets_allocated_inc(cg, prot);
+}
+
+static inline int
+sk_sockets_allocated_read_positive(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct mem_cgroup *cg = sk->sk_cgrp;
+
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline int
+kcg_sockets_allocated_sum_positive(struct proto *prot, struct mem_cgroup *cg)
+{
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline long
+kcg_memory_allocated(struct proto *prot, struct mem_cgroup *cg)
+{
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
 
 #ifdef CONFIG_PROC_FS
 /* Called with local bh disabled */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..c835ae3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -45,6 +45,7 @@
 #include <net/dst.h>
 
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -253,9 +254,11 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
-extern atomic_long_t tcp_memory_allocated;
-extern struct percpu_counter tcp_sockets_allocated;
-extern int tcp_memory_pressure;
+struct mem_cgroup;
+extern long *tcp_sysctl_mem(struct mem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg);
+int *memory_pressure_tcp(struct mem_cgroup *sg);
+atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -286,7 +289,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    sk_memory_allocated(sk) > sk_prot_mem(sk, 2))
 		return true;
 	return false;
 }
diff --git a/include/net/udp.h b/include/net/udp.h
index 67ea6fc..6ed4cc6 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
-extern atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct mem_cgroup *sg);
+long *udp_sysctl_mem(struct mem_cgroup *sg);
 
 /* sysctl variables for udp */
 extern long sysctl_udp_mem[3];
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 779abb9..12a6083 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__field(long *, prot_mem)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = prot->sysctl_rmem[0];
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
@@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
 		"sysctl_rmem=%d rmem_alloc=%d",
 		__entry->name,
-		__entry->sysctl_mem[0],
-		__entry->sysctl_mem[1],
-		__entry->sysctl_mem[2],
+		__entry->prot_mem[0],
+		__entry->prot_mem[1],
+		__entry->prot_mem[2],
 		__entry->allocated,
 		__entry->sysctl_rmem,
 		__entry->rmem_alloc)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a7d335..03d6d61 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 };
 
+static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
@@ -375,6 +376,49 @@ void sock_release_memcg(struct sock *sk)
 {
 	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
 }
+
+void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
+			  int amt, int *parent_failure)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem != NULL; mem = parent_mem_cgroup(mem)) {
+		long alloc;
+		long *prot_mem = prot->prot_mem(mem);
+		/*
+		 * Large nestings are not the common case, and stopping in the
+		 * middle would be complicated enough, that we bill it all the
+		 * way through the root, and if needed, unbill everything later
+		 */
+		alloc = atomic_long_add_return(amt,
+					       prot->memory_allocated(mem));
+		*parent_failure |= (alloc > prot_mem[2]);
+	}
+}
+EXPORT_SYMBOL(memcg_sock_mem_alloc);
+
+void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem != NULL; mem = parent_mem_cgroup(mem))
+		atomic_long_sub(amt, prot->memory_allocated(mem));
+}
+EXPORT_SYMBOL(memcg_sock_mem_free);
+
+void memcg_sockets_allocated_dec(struct mem_cgroup *mem, struct proto *prot)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem; mem = parent_mem_cgroup(mem))
+		percpu_counter_dec(prot->sockets_allocated(mem));
+}
+EXPORT_SYMBOL(memcg_sockets_allocated_dec);
+
+void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot)
+{
+	mem = parent_mem_cgroup(mem);
+	for (; mem; mem = parent_mem_cgroup(mem))
+		percpu_counter_inc(prot->sockets_allocated(mem));
+}
+EXPORT_SYMBOL(memcg_sockets_allocated_inc);
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -460,7 +504,6 @@ enum mem_type {
 
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
-static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
diff --git a/net/core/sock.c b/net/core/sock.c
index 54ec8ac..338d572 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1291,7 +1291,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		newsk->sk_wq = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+			sk_sockets_allocated_inc(newsk);
 
 		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
 		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1682,30 +1682,33 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
+	int *memory_pressure;
+	int parent_failure = 0;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_long_add_return(amt, prot->memory_allocated);
+
+	memory_pressure = sk_memory_pressure(sk);
+	allocated = sk_memory_allocated_add(sk, amt, &parent_failure);
+
+	/* Over hard limit (we, or our parents) */
+	if (parent_failure || (allocated > sk_prot_mem(sk, 2)))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (allocated <= prot->sysctl_mem[0]) {
-		if (prot->memory_pressure && *prot->memory_pressure)
-			*prot->memory_pressure = 0;
-		return 1;
-	}
+	if (allocated <= sk_prot_mem(sk, 0))
+		if (memory_pressure && *memory_pressure)
+			*memory_pressure = 0;
 
 	/* Under pressure. */
-	if (allocated > prot->sysctl_mem[1])
+	if (allocated > sk_prot_mem(sk, 1))
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
-	/* Over hard limit. */
-	if (allocated > prot->sysctl_mem[2])
-		goto suppress_allocation;
-
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
 			return 1;
+
 	} else { /* SK_MEM_SEND */
 		if (sk->sk_type == SOCK_STREAM) {
 			if (sk->sk_wmem_queued < prot->sysctl_wmem[0])
@@ -1715,13 +1718,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 				return 1;
 	}
 
-	if (prot->memory_pressure) {
+	if (memory_pressure) {
 		int alloc;
 
-		if (!*prot->memory_pressure)
+		if (!*memory_pressure)
 			return 1;
-		alloc = percpu_counter_read_positive(prot->sockets_allocated);
-		if (prot->sysctl_mem[2] > alloc *
+		alloc = sk_sockets_allocated_read_positive(sk);
+		if (sk_prot_mem(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
 				 sk->sk_forward_alloc))
@@ -1744,7 +1747,9 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_long_sub(amt, prot->memory_allocated);
+
+	sk_memory_allocated_sub(sk, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1755,15 +1760,15 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reclaim(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
+	int *memory_pressure = sk_memory_pressure(sk);
 
-	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
+	sk_memory_allocated_sub(sk,
+				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	if (memory_pressure && *memory_pressure &&
+	    (sk_memory_allocated(sk) < sk_prot_mem(sk, 0)))
+		*memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
@@ -2482,13 +2487,20 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
+	struct mem_cgroup *cg = mem_cgroup_from_task(current);
+	int *memory_pressure = NULL;
+
+	if (proto->memory_pressure)
+		memory_pressure = proto->memory_pressure(cg);
+
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
-		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+		   proto->memory_allocated != NULL ?
+			kcg_memory_allocated(proto, cg) : -1L,
+		   memory_pressure != NULL ? *memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
 		   module_name(proto->owner),
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 19acd00..4752118 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
 	}
 }
 
+static atomic_long_t *memory_allocated_dn(struct mem_cgroup *sg)
+{
+	return &decnet_memory_allocated;
+}
+
+static int *memory_pressure_dn(struct mem_cgroup *sg)
+{
+	return &dn_memory_pressure;
+}
+
+static long *dn_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_decnet_mem;
+}
+
 static struct proto dn_proto = {
 	.name			= "NSP",
 	.owner			= THIS_MODULE,
 	.enter_memory_pressure	= dn_enter_memory_pressure,
-	.memory_pressure	= &dn_memory_pressure,
-	.memory_allocated	= &decnet_memory_allocated,
-	.sysctl_mem		= sysctl_decnet_mem,
+	.memory_pressure	= memory_pressure_dn,
+	.memory_allocated	= memory_allocated_dn,
+	.prot_mem		= dn_sysctl_mem,
 	.sysctl_wmem		= sysctl_decnet_wmem,
 	.sysctl_rmem		= sysctl_decnet_rmem,
 	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b14ec7d..ba56702 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -52,20 +52,21 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
 	int orphans, sockets;
+	struct mem_cgroup *cg = mem_cgroup_from_task(current);
 
 	local_bh_disable();
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
-	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+	sockets = kcg_sockets_allocated_sum_positive(&tcp_prot, cg);
 	local_bh_enable();
 
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_long_read(&tcp_memory_allocated));
+		   kcg_memory_allocated(&tcp_prot, cg));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_long_read(&udp_memory_allocated));
+		   kcg_memory_allocated(&udp_prot, cg));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..452245f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -291,13 +291,11 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-EXPORT_SYMBOL(tcp_memory_allocated);
 
 /*
  * Current number of TCP sockets.
  */
 struct percpu_counter tcp_sockets_allocated;
-EXPORT_SYMBOL(tcp_sockets_allocated);
 
 /*
  * TCP splice context
@@ -315,7 +313,18 @@ struct tcp_splice_state {
  * is strict, actions are advisory and have some latency.
  */
 int tcp_memory_pressure __read_mostly;
-EXPORT_SYMBOL(tcp_memory_pressure);
+
+int *memory_pressure_tcp(struct mem_cgroup *sg)
+{
+	return &tcp_memory_pressure;
+}
+EXPORT_SYMBOL(memory_pressure_tcp);
+
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
+{
+	return &tcp_sockets_allocated;
+}
+EXPORT_SYMBOL(sockets_allocated_tcp);
 
 void tcp_enter_memory_pressure(struct sock *sk)
 {
@@ -326,6 +335,18 @@ void tcp_enter_memory_pressure(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
 
+long *tcp_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_tcp_mem;
+}
+EXPORT_SYMBOL(tcp_sysctl_mem);
+
+atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg)
+{
+	return &tcp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_tcp);
+
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..3f17423 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
 	/* Check #1 */
 	if (tp->rcv_ssthresh < tp->window_clamp &&
 	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-	    !tcp_memory_pressure) {
+	    !sk_memory_pressure(sk)) {
 		int incr;
 
 		/* Check #2. Increase window, if skb with such overhead
@@ -398,8 +398,8 @@ static void tcp_clamp_window(struct sock *sk)
 
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
-	    !tcp_memory_pressure &&
-	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    !sk_memory_pressure(sk) &&
+	    sk_memory_allocated(sk) < sk_prot_mem(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -4806,7 +4806,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		tcp_clamp_window(sk);
-	else if (tcp_memory_pressure)
+	else if (sk_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
 	tcp_collapse_ofo_queue(sk);
@@ -4872,11 +4872,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
 		return 0;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (sk_memory_allocated(sk) >= sk_prot_mem(sk, 0))
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1c12b8e..cbb0d5e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1901,7 +1901,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -1957,7 +1957,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
-	percpu_counter_dec(&tcp_sockets_allocated);
+	sk_sockets_allocated_dec(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2598,11 +2598,11 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.memory_allocated	= memory_allocated_tcp,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..06aeb31 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_memory_pressure)
+		if (sk_memory_pressure(sk))
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..2c67617 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
 	}
 
 out:
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		sk_mem_reclaim(sk);
 out_unlock:
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1b5a193..cc7627b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
 int sysctl_udp_wmem_min __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_wmem_min);
 
-atomic_long_t udp_memory_allocated;
-EXPORT_SYMBOL(udp_memory_allocated);
-
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
@@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(udp_poll);
 
+static atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct mem_cgroup *sg)
+{
+	return &udp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_udp);
+
+long *udp_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_udp_mem;
+}
+EXPORT_SYMBOL(udp_sysctl_mem);
+
 struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
@@ -1936,8 +1946,8 @@ struct proto udp_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = &memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp_sock),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d1fb63f..807797a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2012,7 +2012,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -2221,11 +2221,11 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 29213b5..ef4b5b3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp6_sock),
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 836aa63..0ddf561 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -119,11 +119,30 @@ static int sctp_memory_pressure;
 static atomic_long_t sctp_memory_allocated;
 struct percpu_counter sctp_sockets_allocated;
 
+static long *sctp_sysctl_mem(struct mem_cgroup *sg)
+{
+	return sysctl_sctp_mem;
+}
+
 static void sctp_enter_memory_pressure(struct sock *sk)
 {
 	sctp_memory_pressure = 1;
 }
 
+static int *memory_pressure_sctp(struct mem_cgroup *sg)
+{
+	return &sctp_memory_pressure;
+}
+
+static atomic_long_t *memory_allocated_sctp(struct mem_cgroup *sg)
+{
+	return &sctp_memory_allocated;
+}
+
+static struct percpu_counter *sockets_allocated_sctp(struct mem_cgroup *sg)
+{
+	return &sctp_sockets_allocated;
+}
 
 /* Get the sndbuf space available at the time on the association.  */
 static inline int sctp_wspace(struct sctp_association *asoc)
@@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
 	.unhash      =	sctp_unhash,
 	.get_port    =	sctp_get_port,
 	.obj_size    =  sizeof(struct sctp_sock),
-	.sysctl_mem  =  sysctl_sctp_mem,
+	.prot_mem    =  sctp_sysctl_mem,
 	.sysctl_rmem =  sysctl_sctp_rmem,
 	.sysctl_wmem =  sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
 	.unhash		= sctp_unhash,
 	.get_port	= sctp_get_port,
 	.obj_size	= sizeof(struct sctp6_sock),
-	.sysctl_mem	= sysctl_sctp_mem,
+	.prot_mem	= sctp_sysctl_mem,
 	.sysctl_rmem	= sysctl_sctp_rmem,
 	.sysctl_wmem	= sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

With all the infrastructure in place, this patch implements
per-cgroup control for tcp memory pressure handling.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/memcontrol.h |    6 +++
 include/net/sock.h         |   15 ++++++-
 include/net/tcp.h          |   10 ++--
 mm/memcontrol.c            |  107 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c            |   38 ++++++++++++++-
 net/ipv4/tcp.c             |   44 +++++++++---------
 net/ipv4/tcp_ipv4.c        |   14 ++++--
 net/ipv6/tcp_ipv6.c        |   13 +++--
 8 files changed, 205 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1744ae8..6b8c0c0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -410,6 +410,12 @@ void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
 void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt);
 void memcg_sockets_allocated_dec(struct mem_cgroup *mem, struct proto *prot);
 void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot);
+int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
+		    struct cgroup_subsys *ss);
+int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
+			 struct cgroup_subsys *ss);
+void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
+			struct cgroup_subsys *ss);
 #else
 /* memcontrol includes sockets.h, that includes memcontrol.h ... */
 static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
diff --git a/include/net/sock.h b/include/net/sock.h
index 78832f9..e9ae8a4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -64,6 +64,8 @@
 #include <net/dst.h>
 #include <net/checksum.h>
 
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss);
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -814,7 +816,18 @@ struct proto {
 	int			*(*memory_pressure)(struct mem_cgroup *sg);
 	/* Pointer to the per-cgroup version of the the sysctl_mem field */
 	long			*(*prot_mem)(struct mem_cgroup *sg);
-
+	/*
+	 * cgroup specific init/deinit functions. Called once for all
+	 * protocols that implement it, from cgroups populate function.
+	 * This function has to setup any files the protocol want to
+	 * appear in the kmem cgroup filesystem.
+	 */
+	int			(*init_cgroup)(struct proto *prot,
+					       struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
+	void			(*destroy_cgroup)(struct proto *prot,
+						  struct cgroup *cgrp,
+						  struct cgroup_subsys *ss);
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c835ae3..ce3c211 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -255,10 +255,10 @@ extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
 struct mem_cgroup;
-extern long *tcp_sysctl_mem(struct mem_cgroup *sg);
-struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg);
-int *memory_pressure_tcp(struct mem_cgroup *sg);
-atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg);
+extern long *tcp_sysctl_mem_nocg(struct mem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp_nocg(struct mem_cgroup *sg);
+int *memory_pressure_tcp_nocg(struct mem_cgroup *sg);
+atomic_long_t *memory_allocated_tcp_nocg(struct mem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -1002,7 +1002,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->loc_port = tcp_hdr(skb)->dest;
 }
 
-extern void tcp_enter_memory_pressure(struct sock *sk);
+extern void tcp_enter_memory_pressure_nocg(struct sock *sk);
 
 static inline int keepalive_intvl_when(const struct tcp_sock *tp)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 03d6d61..4bcb052 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -343,6 +343,13 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/* per-cgroup tcp memory pressure knobs */
+	atomic_long_t tcp_memory_allocated;
+	struct percpu_counter tcp_sockets_allocated;
+	/* those two are read-mostly, leave them at the end */
+	long tcp_prot_mem[3];
+	int tcp_memory_pressure;
 };
 
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -350,6 +357,8 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
 #include <net/sock.h>
+#include <net/tcp.h>
+#include <net/ip.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -419,6 +428,90 @@ void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot)
 		percpu_counter_inc(prot->sockets_allocated(mem));
 }
 EXPORT_SYMBOL(memcg_sockets_allocated_inc);
+
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure(struct sock *sk)
+{
+	struct mem_cgroup *sg = sk->sk_cgrp;
+	if (!sg->tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		sg->tcp_memory_pressure = 1;
+	}
+}
+
+long *tcp_sysctl_mem(struct mem_cgroup *cg)
+{
+	return cg->tcp_prot_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct mem_cgroup *cg)
+{
+	return &(cg->tcp_memory_allocated);
+}
+
+int *memory_pressure_tcp(struct mem_cgroup *sg)
+{
+	return &sg->tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
+{
+	return &sg->tcp_sockets_allocated;
+}
+
+/*
+ * For ipv6, we only need to fill in the function pointers (can't initialize
+ * things twice). So keep it separated
+ */
+int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
+			 struct cgroup_subsys *ss)
+{
+	prot->enter_memory_pressure = tcp_enter_memory_pressure;
+	prot->memory_allocated = memory_allocated_tcp;
+	prot->prot_mem = tcp_sysctl_mem;
+	prot->sockets_allocated = sockets_allocated_tcp;
+	prot->memory_pressure = memory_pressure_tcp;
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup_fill);
+
+int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
+		    struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
+	unsigned long limit;
+
+	cg->tcp_memory_pressure = 0;
+	atomic_long_set(&cg->tcp_memory_allocated, 0);
+	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
+
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+
+	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
+	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
+	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
+
+	tcp_init_cgroup_fill(prot, cgrp, ss);
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
+			struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
+
+	percpu_counter_destroy(&cg->tcp_sockets_allocated);
+}
+EXPORT_SYMBOL(tcp_destroy_cgroup);
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -5026,9 +5119,21 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 	if (!mem_cgroup_is_root(mem))
 		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
 					ARRAY_SIZE(kmem_cgroup_files));
+
+	if (!ret)
+		ret = sockets_populate(cont, ss);
+
 	return ret;
 };
 
+static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
+				struct cgroup *cont)
+{
+	if (!do_kmem_account)
+		return;
+
+	sockets_destroy(cont, ss);
+}
 #else
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
@@ -5277,6 +5382,8 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kmem_cgroup_destroy(ss, cont);
+
 	mem_cgroup_put(mem);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 338d572..92cf417 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -135,6 +135,41 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct proto *proto;
+	int ret = 0;
+
+	read_lock(&proto_list_lock);
+	list_for_each_entry(proto, &proto_list, node) {
+		if (proto->init_cgroup)
+			ret |= proto->init_cgroup(proto, cgrp, ss);
+	}
+	if (!ret)
+		goto out;
+
+	list_for_each_entry_continue_reverse(proto, &proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(proto, cgrp, ss);
+
+out:
+	read_unlock(&proto_list_lock);
+	return ret;
+}
+
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct proto *proto;
+	read_lock(&proto_list_lock);
+	list_for_each_entry_reverse(proto, &proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(proto, cgrp, ss);
+	read_unlock(&proto_list_lock);
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -2260,9 +2295,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 452245f..156b836 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,13 +290,6 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-
-/*
- * Current number of TCP sockets.
- */
-struct percpu_counter tcp_sockets_allocated;
-
 /*
  * TCP splice context
  */
@@ -306,46 +299,49 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
-/*
- * Pressure flag: try to collapse.
- * Technical note: it is used by multiple contexts non atomically.
- * All the __sk_mem_schedule() is of this nature: accounting
- * is strict, actions are advisory and have some latency.
- */
+/* Current number of TCP sockets. */
+struct percpu_counter tcp_sockets_allocated;
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 int tcp_memory_pressure __read_mostly;
 
-int *memory_pressure_tcp(struct mem_cgroup *sg)
+int *memory_pressure_tcp_nocg(struct mem_cgroup *sg)
 {
 	return &tcp_memory_pressure;
 }
-EXPORT_SYMBOL(memory_pressure_tcp);
+EXPORT_SYMBOL(memory_pressure_tcp_nocg);
 
-struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
+struct percpu_counter *sockets_allocated_tcp_nocg(struct mem_cgroup *sg)
 {
 	return &tcp_sockets_allocated;
 }
-EXPORT_SYMBOL(sockets_allocated_tcp);
+EXPORT_SYMBOL(sockets_allocated_tcp_nocg);
 
-void tcp_enter_memory_pressure(struct sock *sk)
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure_nocg(struct sock *sk)
 {
 	if (!tcp_memory_pressure) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
 		tcp_memory_pressure = 1;
 	}
 }
-EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL(tcp_enter_memory_pressure_nocg);
 
-long *tcp_sysctl_mem(struct mem_cgroup *sg)
+long *tcp_sysctl_mem_nocg(struct mem_cgroup *sg)
 {
 	return sysctl_tcp_mem;
 }
-EXPORT_SYMBOL(tcp_sysctl_mem);
+EXPORT_SYMBOL(tcp_sysctl_mem_nocg);
 
-atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg)
+atomic_long_t *memory_allocated_tcp_nocg(struct mem_cgroup *sg)
 {
 	return &tcp_memory_allocated;
 }
-EXPORT_SYMBOL(memory_allocated_tcp);
+EXPORT_SYMBOL(memory_allocated_tcp_nocg);
 
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
@@ -3247,7 +3243,9 @@ void __init tcp_init(void)
 
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
 
+#ifndef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 	percpu_counter_init(&tcp_sockets_allocated, 0);
+#endif
 	percpu_counter_init(&tcp_orphan_count, 0);
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cbb0d5e..c857baf 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2597,12 +2597,16 @@ struct proto tcp_prot = {
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
-	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.memory_pressure	= memory_pressure_tcp,
-	.sockets_allocated	= sockets_allocated_tcp,
+	.enter_memory_pressure	= tcp_enter_memory_pressure_nocg,
+	.memory_pressure	= memory_pressure_tcp_nocg,
+	.sockets_allocated	= sockets_allocated_tcp_nocg,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= memory_allocated_tcp,
-	.prot_mem		= tcp_sysctl_mem,
+	.memory_allocated	= memory_allocated_tcp_nocg,
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	.init_cgroup		= tcp_init_cgroup,
+	.destroy_cgroup		= tcp_destroy_cgroup,
+#endif
+	.prot_mem		= tcp_sysctl_mem_nocg,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 807797a..5cd13c9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2220,12 +2220,15 @@ struct proto tcpv6_prot = {
 	.hash			= tcp_v6_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
-	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= sockets_allocated_tcp,
-	.memory_allocated	= memory_allocated_tcp,
-	.memory_pressure	= memory_pressure_tcp,
+	.enter_memory_pressure	= tcp_enter_memory_pressure_nocg,
+	.sockets_allocated	= sockets_allocated_tcp_nocg,
+	.memory_allocated	= memory_allocated_tcp_nocg,
+	.memory_pressure	= memory_pressure_tcp_nocg,
 	.orphan_count		= &tcp_orphan_count,
-	.prot_mem		= tcp_sysctl_mem,
+	.prot_mem		= tcp_sysctl_mem_nocg,
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	.init_cgroup		= tcp_init_cgroup_fill,
+#endif
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

With all the infrastructure in place, this patch implements
per-cgroup control for tcp memory pressure handling.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/memcontrol.h |    6 +++
 include/net/sock.h         |   15 ++++++-
 include/net/tcp.h          |   10 ++--
 mm/memcontrol.c            |  107 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c            |   38 ++++++++++++++-
 net/ipv4/tcp.c             |   44 +++++++++---------
 net/ipv4/tcp_ipv4.c        |   14 ++++--
 net/ipv6/tcp_ipv6.c        |   13 +++--
 8 files changed, 205 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1744ae8..6b8c0c0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -410,6 +410,12 @@ void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
 void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt);
 void memcg_sockets_allocated_dec(struct mem_cgroup *mem, struct proto *prot);
 void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot);
+int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
+		    struct cgroup_subsys *ss);
+int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
+			 struct cgroup_subsys *ss);
+void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
+			struct cgroup_subsys *ss);
 #else
 /* memcontrol includes sockets.h, that includes memcontrol.h ... */
 static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
diff --git a/include/net/sock.h b/include/net/sock.h
index 78832f9..e9ae8a4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -64,6 +64,8 @@
 #include <net/dst.h>
 #include <net/checksum.h>
 
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss);
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -814,7 +816,18 @@ struct proto {
 	int			*(*memory_pressure)(struct mem_cgroup *sg);
 	/* Pointer to the per-cgroup version of the the sysctl_mem field */
 	long			*(*prot_mem)(struct mem_cgroup *sg);
-
+	/*
+	 * cgroup specific init/deinit functions. Called once for all
+	 * protocols that implement it, from cgroups populate function.
+	 * This function has to setup any files the protocol want to
+	 * appear in the kmem cgroup filesystem.
+	 */
+	int			(*init_cgroup)(struct proto *prot,
+					       struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
+	void			(*destroy_cgroup)(struct proto *prot,
+						  struct cgroup *cgrp,
+						  struct cgroup_subsys *ss);
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c835ae3..ce3c211 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -255,10 +255,10 @@ extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
 struct mem_cgroup;
-extern long *tcp_sysctl_mem(struct mem_cgroup *sg);
-struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg);
-int *memory_pressure_tcp(struct mem_cgroup *sg);
-atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg);
+extern long *tcp_sysctl_mem_nocg(struct mem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp_nocg(struct mem_cgroup *sg);
+int *memory_pressure_tcp_nocg(struct mem_cgroup *sg);
+atomic_long_t *memory_allocated_tcp_nocg(struct mem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -1002,7 +1002,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->loc_port = tcp_hdr(skb)->dest;
 }
 
-extern void tcp_enter_memory_pressure(struct sock *sk);
+extern void tcp_enter_memory_pressure_nocg(struct sock *sk);
 
 static inline int keepalive_intvl_when(const struct tcp_sock *tp)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 03d6d61..4bcb052 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -343,6 +343,13 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+	/* per-cgroup tcp memory pressure knobs */
+	atomic_long_t tcp_memory_allocated;
+	struct percpu_counter tcp_sockets_allocated;
+	/* those two are read-mostly, leave them at the end */
+	long tcp_prot_mem[3];
+	int tcp_memory_pressure;
 };
 
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -350,6 +357,8 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
 #include <net/sock.h>
+#include <net/tcp.h>
+#include <net/ip.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -419,6 +428,90 @@ void memcg_sockets_allocated_inc(struct mem_cgroup *mem, struct proto *prot)
 		percpu_counter_inc(prot->sockets_allocated(mem));
 }
 EXPORT_SYMBOL(memcg_sockets_allocated_inc);
+
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure(struct sock *sk)
+{
+	struct mem_cgroup *sg = sk->sk_cgrp;
+	if (!sg->tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		sg->tcp_memory_pressure = 1;
+	}
+}
+
+long *tcp_sysctl_mem(struct mem_cgroup *cg)
+{
+	return cg->tcp_prot_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct mem_cgroup *cg)
+{
+	return &(cg->tcp_memory_allocated);
+}
+
+int *memory_pressure_tcp(struct mem_cgroup *sg)
+{
+	return &sg->tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
+{
+	return &sg->tcp_sockets_allocated;
+}
+
+/*
+ * For ipv6, we only need to fill in the function pointers (can't initialize
+ * things twice). So keep it separated
+ */
+int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
+			 struct cgroup_subsys *ss)
+{
+	prot->enter_memory_pressure = tcp_enter_memory_pressure;
+	prot->memory_allocated = memory_allocated_tcp;
+	prot->prot_mem = tcp_sysctl_mem;
+	prot->sockets_allocated = sockets_allocated_tcp;
+	prot->memory_pressure = memory_pressure_tcp;
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup_fill);
+
+int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
+		    struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
+	unsigned long limit;
+
+	cg->tcp_memory_pressure = 0;
+	atomic_long_set(&cg->tcp_memory_allocated, 0);
+	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
+
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+
+	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
+	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
+	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
+
+	tcp_init_cgroup_fill(prot, cgrp, ss);
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
+			struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
+
+	percpu_counter_destroy(&cg->tcp_sockets_allocated);
+}
+EXPORT_SYMBOL(tcp_destroy_cgroup);
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -5026,9 +5119,21 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 	if (!mem_cgroup_is_root(mem))
 		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
 					ARRAY_SIZE(kmem_cgroup_files));
+
+	if (!ret)
+		ret = sockets_populate(cont, ss);
+
 	return ret;
 };
 
+static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
+				struct cgroup *cont)
+{
+	if (!do_kmem_account)
+		return;
+
+	sockets_destroy(cont, ss);
+}
 #else
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
@@ -5277,6 +5382,8 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kmem_cgroup_destroy(ss, cont);
+
 	mem_cgroup_put(mem);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 338d572..92cf417 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -135,6 +135,41 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct proto *proto;
+	int ret = 0;
+
+	read_lock(&proto_list_lock);
+	list_for_each_entry(proto, &proto_list, node) {
+		if (proto->init_cgroup)
+			ret |= proto->init_cgroup(proto, cgrp, ss);
+	}
+	if (!ret)
+		goto out;
+
+	list_for_each_entry_continue_reverse(proto, &proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(proto, cgrp, ss);
+
+out:
+	read_unlock(&proto_list_lock);
+	return ret;
+}
+
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct proto *proto;
+	read_lock(&proto_list_lock);
+	list_for_each_entry_reverse(proto, &proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(proto, cgrp, ss);
+	read_unlock(&proto_list_lock);
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -2260,9 +2295,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 452245f..156b836 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,13 +290,6 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-
-/*
- * Current number of TCP sockets.
- */
-struct percpu_counter tcp_sockets_allocated;
-
 /*
  * TCP splice context
  */
@@ -306,46 +299,49 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
-/*
- * Pressure flag: try to collapse.
- * Technical note: it is used by multiple contexts non atomically.
- * All the __sk_mem_schedule() is of this nature: accounting
- * is strict, actions are advisory and have some latency.
- */
+/* Current number of TCP sockets. */
+struct percpu_counter tcp_sockets_allocated;
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 int tcp_memory_pressure __read_mostly;
 
-int *memory_pressure_tcp(struct mem_cgroup *sg)
+int *memory_pressure_tcp_nocg(struct mem_cgroup *sg)
 {
 	return &tcp_memory_pressure;
 }
-EXPORT_SYMBOL(memory_pressure_tcp);
+EXPORT_SYMBOL(memory_pressure_tcp_nocg);
 
-struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
+struct percpu_counter *sockets_allocated_tcp_nocg(struct mem_cgroup *sg)
 {
 	return &tcp_sockets_allocated;
 }
-EXPORT_SYMBOL(sockets_allocated_tcp);
+EXPORT_SYMBOL(sockets_allocated_tcp_nocg);
 
-void tcp_enter_memory_pressure(struct sock *sk)
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure_nocg(struct sock *sk)
 {
 	if (!tcp_memory_pressure) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
 		tcp_memory_pressure = 1;
 	}
 }
-EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL(tcp_enter_memory_pressure_nocg);
 
-long *tcp_sysctl_mem(struct mem_cgroup *sg)
+long *tcp_sysctl_mem_nocg(struct mem_cgroup *sg)
 {
 	return sysctl_tcp_mem;
 }
-EXPORT_SYMBOL(tcp_sysctl_mem);
+EXPORT_SYMBOL(tcp_sysctl_mem_nocg);
 
-atomic_long_t *memory_allocated_tcp(struct mem_cgroup *sg)
+atomic_long_t *memory_allocated_tcp_nocg(struct mem_cgroup *sg)
 {
 	return &tcp_memory_allocated;
 }
-EXPORT_SYMBOL(memory_allocated_tcp);
+EXPORT_SYMBOL(memory_allocated_tcp_nocg);
 
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
@@ -3247,7 +3243,9 @@ void __init tcp_init(void)
 
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
 
+#ifndef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 	percpu_counter_init(&tcp_sockets_allocated, 0);
+#endif
 	percpu_counter_init(&tcp_orphan_count, 0);
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cbb0d5e..c857baf 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2597,12 +2597,16 @@ struct proto tcp_prot = {
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
-	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.memory_pressure	= memory_pressure_tcp,
-	.sockets_allocated	= sockets_allocated_tcp,
+	.enter_memory_pressure	= tcp_enter_memory_pressure_nocg,
+	.memory_pressure	= memory_pressure_tcp_nocg,
+	.sockets_allocated	= sockets_allocated_tcp_nocg,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= memory_allocated_tcp,
-	.prot_mem		= tcp_sysctl_mem,
+	.memory_allocated	= memory_allocated_tcp_nocg,
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	.init_cgroup		= tcp_init_cgroup,
+	.destroy_cgroup		= tcp_destroy_cgroup,
+#endif
+	.prot_mem		= tcp_sysctl_mem_nocg,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 807797a..5cd13c9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2220,12 +2220,15 @@ struct proto tcpv6_prot = {
 	.hash			= tcp_v6_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
-	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= sockets_allocated_tcp,
-	.memory_allocated	= memory_allocated_tcp,
-	.memory_pressure	= memory_pressure_tcp,
+	.enter_memory_pressure	= tcp_enter_memory_pressure_nocg,
+	.sockets_allocated	= sockets_allocated_tcp_nocg,
+	.memory_allocated	= memory_allocated_tcp_nocg,
+	.memory_pressure	= memory_pressure_tcp_nocg,
 	.orphan_count		= &tcp_orphan_count,
-	.prot_mem		= tcp_sysctl_mem,
+	.prot_mem		= tcp_sysctl_mem_nocg,
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	.init_cgroup		= tcp_init_cgroup_fill,
+#endif
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 5/7] per-netns ipv4 sysctl_tcp_mem
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    1 -
 mm/memcontrol.c            |    8 ++++--
 net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c             |   13 ++--------
 5 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ce3c211..257e1f9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -231,7 +231,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4bcb052..5e9b2c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,6 +359,7 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 #include <net/sock.h>
 #include <net/tcp.h>
 #include <net/ip.h>
+#include <linux/nsproxy.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -487,6 +488,7 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 {
 	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
 	unsigned long limit;
+	struct net *net = current->nsproxy->net_ns;
 
 	cg->tcp_memory_pressure = 0;
 	atomic_long_set(&cg->tcp_memory_allocated, 0);
@@ -495,9 +497,9 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 	limit = nr_free_buffer_pages() / 8;
 	limit = max(limit, 128UL);
 
-	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
-	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
-	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
+	cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
 	tcp_init_cgroup_fill(prot, cgrp, ss);
 	return 0;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..bbd67ab 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct net *net = current->nsproxy->net_ns;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	net->ipv4.sysctl_tcp_mem[0] = vec[0];
+	net->ipv4.sysctl_tcp_mem[1] = vec[1];
+	net->ipv4.sysctl_tcp_mem[2] = vec[2];
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 156b836..a94a0f1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -282,11 +282,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
@@ -333,7 +331,7 @@ EXPORT_SYMBOL(tcp_enter_memory_pressure_nocg);
 
 long *tcp_sysctl_mem_nocg(struct mem_cgroup *sg)
 {
-	return sysctl_tcp_mem;
+	return init_net.ipv4.sysctl_tcp_mem;
 }
 EXPORT_SYMBOL(tcp_sysctl_mem_nocg);
 
@@ -3296,14 +3294,9 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = ((unsigned long)init_net.ipv4.sysctl_tcp_mem[1])
+		<< (PAGE_SHIFT - 7);
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 5/7] per-netns ipv4 sysctl_tcp_mem
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    1 -
 mm/memcontrol.c            |    8 ++++--
 net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c             |   13 ++--------
 5 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ce3c211..257e1f9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -231,7 +231,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4bcb052..5e9b2c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,6 +359,7 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 #include <net/sock.h>
 #include <net/tcp.h>
 #include <net/ip.h>
+#include <linux/nsproxy.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -487,6 +488,7 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 {
 	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
 	unsigned long limit;
+	struct net *net = current->nsproxy->net_ns;
 
 	cg->tcp_memory_pressure = 0;
 	atomic_long_set(&cg->tcp_memory_allocated, 0);
@@ -495,9 +497,9 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 	limit = nr_free_buffer_pages() / 8;
 	limit = max(limit, 128UL);
 
-	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
-	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
-	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
+	cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
 	tcp_init_cgroup_fill(prot, cgrp, ss);
 	return 0;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..bbd67ab 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct net *net = current->nsproxy->net_ns;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	net->ipv4.sysctl_tcp_mem[0] = vec[0];
+	net->ipv4.sysctl_tcp_mem[1] = vec[1];
+	net->ipv4.sysctl_tcp_mem[2] = vec[2];
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 156b836..a94a0f1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -282,11 +282,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
@@ -333,7 +331,7 @@ EXPORT_SYMBOL(tcp_enter_memory_pressure_nocg);
 
 long *tcp_sysctl_mem_nocg(struct mem_cgroup *sg)
 {
-	return sysctl_tcp_mem;
+	return init_net.ipv4.sysctl_tcp_mem;
 }
 EXPORT_SYMBOL(tcp_sysctl_mem_nocg);
 
@@ -3296,14 +3294,9 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = ((unsigned long)init_net.ipv4.sysctl_tcp_mem[1])
+		<< (PAGE_SHIFT - 7);
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch uses the "tcp_max_mem" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

We have to make sure that none of the memory pressure thresholds
specified in the namespace are bigger than the current cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 include/linux/memcontrol.h       |   10 ++++
 mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
 net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
 4 files changed, 113 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f1954a..1ffde3e 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -78,6 +78,7 @@ Brief summary of control files.
 
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
+ memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
 
 1. History
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6b8c0c0..2df6db8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
 			 struct cgroup_subsys *ss);
 void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
 			struct cgroup_subsys *ss);
+
+unsigned long tcp_max_memory(struct mem_cgroup *cg);
+void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
 #else
 /* memcontrol includes sockets.h, that includes memcontrol.h ... */
 static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
@@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
+{
+	return 0;
+}
+static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 #endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e9b2c7..be5ab89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 	/* per-cgroup tcp memory pressure knobs */
+	int tcp_max_memory;
 	atomic_long_t tcp_memory_allocated;
 	struct percpu_counter tcp_sockets_allocated;
 	/* those two are read-mostly, leave them at the end */
@@ -352,6 +353,11 @@ struct mem_cgroup {
 	int tcp_memory_pressure;
 };
 
+static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
+{
+	return (mem == root_mem_cgroup);
+}
+
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
@@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
 	return &sg->tcp_sockets_allocated;
 }
 
+static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent = parent_mem_cgroup(sg);
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
+	/*
+	 * We can't allow more memory than our parents. Since this
+	 * will be tested for all calls, by induction, there is no need
+	 * to test any parent other than our own
+	 * */
+	if (parent && (val > parent->tcp_max_memory))
+		val = parent->tcp_max_memory;
+
+	sg->tcp_max_memory = val;
+
+	for (i = 0; i < 3; i++)
+		sg->tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
+static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = sg->tcp_max_memory;
+
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype tcp_files[] = {
+	{
+		.name = "kmem.tcp.max_memory",
+		.write_u64 = tcp_write_maxmem,
+		.read_u64 = tcp_read_maxmem,
+	},
+};
+
 /*
  * For ipv6, we only need to fill in the function pointers (can't initialize
  * things twice). So keep it separated
@@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 		    struct cgroup_subsys *ss)
 {
 	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent = parent_mem_cgroup(cg);
 	unsigned long limit;
 	struct net *net = current->nsproxy->net_ns;
+	int ret = 0;
 
 	cg->tcp_memory_pressure = 0;
 	atomic_long_set(&cg->tcp_memory_allocated, 0);
@@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 	limit = nr_free_buffer_pages() / 8;
 	limit = max(limit, 128UL);
 
+	if (parent)
+		cg->tcp_max_memory = parent->tcp_max_memory;
+	else
+		cg->tcp_max_memory = limit * 2;
+
 	cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
 	cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
 	tcp_init_cgroup_fill(prot, cgrp, ss);
-	return 0;
+	/*
+	 * For non-root cgroup, we need to set up all tcp-related variables,
+	 * but to be consistent with the rest of kmem management, we don't
+	 * expose any of the controls
+	 */
+	if (!mem_cgroup_is_root(cg))
+		ret = cgroup_add_files(cgrp, ss, tcp_files,
+				       ARRAY_SIZE(tcp_files));
+	return ret;
 }
 EXPORT_SYMBOL(tcp_init_cgroup);
 
@@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
 	percpu_counter_destroy(&cg->tcp_sockets_allocated);
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
+
+unsigned long tcp_max_memory(struct mem_cgroup *cg)
+{
+	return cg->tcp_max_memory;
+}
+
+void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
+{
+	cg->tcp_prot_mem[idx] = val;
+}
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
 #define for_each_mem_cgroup_all(iter) \
 	for_each_mem_cgroup_tree_cond(iter, NULL, true)
 
-
-static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
-{
-	return (mem == root_mem_cgroup);
-}
-
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *mem;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index bbd67ab..cdc35f6 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/memcontrol.h>
 #include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
@@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	int ret;
 	unsigned long vec[3];
 	struct net *net = current->nsproxy->net_ns;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	int i;
+	struct mem_cgroup *cg;
+#endif
 
 	ctl_table tmp = {
 		.data = &vec,
@@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	rcu_read_lock();
+	cg = mem_cgroup_from_task(current);
+	for (i = 0; i < 3; i++)
+		if (vec[i] > tcp_max_memory(cg)) {
+			rcu_read_unlock();
+			return -EINVAL;
+		}
+
+	tcp_prot_mem(cg, vec[0], 0);
+	tcp_prot_mem(cg, vec[1], 1);
+	tcp_prot_mem(cg, vec[2], 2);
+	rcu_read_unlock();
+#endif
+
 	net->ipv4.sysctl_tcp_mem[0] = vec[0];
 	net->ipv4.sysctl_tcp_mem[1] = vec[1];
 	net->ipv4.sysctl_tcp_mem[2] = vec[2];
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch uses the "tcp_max_mem" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

We have to make sure that none of the memory pressure thresholds
specified in the namespace are bigger than the current cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 include/linux/memcontrol.h       |   10 ++++
 mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
 net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
 4 files changed, 113 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f1954a..1ffde3e 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -78,6 +78,7 @@ Brief summary of control files.
 
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
+ memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
 
 1. History
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6b8c0c0..2df6db8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
 			 struct cgroup_subsys *ss);
 void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
 			struct cgroup_subsys *ss);
+
+unsigned long tcp_max_memory(struct mem_cgroup *cg);
+void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
 #else
 /* memcontrol includes sockets.h, that includes memcontrol.h ... */
 static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
@@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
+{
+	return 0;
+}
+static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 #endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e9b2c7..be5ab89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,7 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 	/* per-cgroup tcp memory pressure knobs */
+	int tcp_max_memory;
 	atomic_long_t tcp_memory_allocated;
 	struct percpu_counter tcp_sockets_allocated;
 	/* those two are read-mostly, leave them at the end */
@@ -352,6 +353,11 @@ struct mem_cgroup {
 	int tcp_memory_pressure;
 };
 
+static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
+{
+	return (mem == root_mem_cgroup);
+}
+
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
@@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
 	return &sg->tcp_sockets_allocated;
 }
 
+static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent = parent_mem_cgroup(sg);
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
+	/*
+	 * We can't allow more memory than our parents. Since this
+	 * will be tested for all calls, by induction, there is no need
+	 * to test any parent other than our own
+	 * */
+	if (parent && (val > parent->tcp_max_memory))
+		val = parent->tcp_max_memory;
+
+	sg->tcp_max_memory = val;
+
+	for (i = 0; i < 3; i++)
+		sg->tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
+static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = sg->tcp_max_memory;
+
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype tcp_files[] = {
+	{
+		.name = "kmem.tcp.max_memory",
+		.write_u64 = tcp_write_maxmem,
+		.read_u64 = tcp_read_maxmem,
+	},
+};
+
 /*
  * For ipv6, we only need to fill in the function pointers (can't initialize
  * things twice). So keep it separated
@@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 		    struct cgroup_subsys *ss)
 {
 	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent = parent_mem_cgroup(cg);
 	unsigned long limit;
 	struct net *net = current->nsproxy->net_ns;
+	int ret = 0;
 
 	cg->tcp_memory_pressure = 0;
 	atomic_long_set(&cg->tcp_memory_allocated, 0);
@@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 	limit = nr_free_buffer_pages() / 8;
 	limit = max(limit, 128UL);
 
+	if (parent)
+		cg->tcp_max_memory = parent->tcp_max_memory;
+	else
+		cg->tcp_max_memory = limit * 2;
+
 	cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
 	cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
 	tcp_init_cgroup_fill(prot, cgrp, ss);
-	return 0;
+	/*
+	 * For non-root cgroup, we need to set up all tcp-related variables,
+	 * but to be consistent with the rest of kmem management, we don't
+	 * expose any of the controls
+	 */
+	if (!mem_cgroup_is_root(cg))
+		ret = cgroup_add_files(cgrp, ss, tcp_files,
+				       ARRAY_SIZE(tcp_files));
+	return ret;
 }
 EXPORT_SYMBOL(tcp_init_cgroup);
 
@@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
 	percpu_counter_destroy(&cg->tcp_sockets_allocated);
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
+
+unsigned long tcp_max_memory(struct mem_cgroup *cg)
+{
+	return cg->tcp_max_memory;
+}
+
+void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
+{
+	cg->tcp_prot_mem[idx] = val;
+}
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
 #define for_each_mem_cgroup_all(iter) \
 	for_each_mem_cgroup_tree_cond(iter, NULL, true)
 
-
-static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
-{
-	return (mem == root_mem_cgroup);
-}
-
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *mem;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index bbd67ab..cdc35f6 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/memcontrol.h>
 #include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
@@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	int ret;
 	unsigned long vec[3];
 	struct net *net = current->nsproxy->net_ns;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	int i;
+	struct mem_cgroup *cg;
+#endif
 
 	ctl_table tmp = {
 		.data = &vec,
@@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	rcu_read_lock();
+	cg = mem_cgroup_from_task(current);
+	for (i = 0; i < 3; i++)
+		if (vec[i] > tcp_max_memory(cg)) {
+			rcu_read_unlock();
+			return -EINVAL;
+		}
+
+	tcp_prot_mem(cg, vec[0], 0);
+	tcp_prot_mem(cg, vec[1], 1);
+	tcp_prot_mem(cg, vec[2], 2);
+	rcu_read_unlock();
+#endif
+
 	net->ipv4.sysctl_tcp_mem[0] = vec[0];
 	net->ipv4.sysctl_tcp_mem[1] = vec[1];
 	net->ipv4.sysctl_tcp_mem[2] = vec[2];
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 7/7] Display current tcp memory allocation in kmem cgroup
  2011-09-19  0:56 ` Glauber Costa
@ 2011-09-19  0:56   ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch introduces kmem.tcp_current_memory file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 mm/memcontrol.c                  |   17 +++++++++++++++++
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 1ffde3e..f5a539d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -79,6 +79,7 @@ Brief summary of control files.
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
  memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
+ memory.kmem.tcp.current_memory  # show current tcp buf memory allocation
 
 1. History
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be5ab89..8c015b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -514,12 +514,29 @@ static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
 	return ret;
 }
 
+static u64 tcp_read_curmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = atomic_long_read(&sg->tcp_memory_allocated);
+
+	cgroup_unlock();
+	return ret;
+}
+
 static struct cftype tcp_files[] = {
 	{
 		.name = "kmem.tcp.max_memory",
 		.write_u64 = tcp_write_maxmem,
 		.read_u64 = tcp_read_maxmem,
 	},
+	{
+		.name = "kmem.tcp.current_memory",
+		.read_u64 = tcp_read_curmem,
+	},
 };
 
 /*
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 7/7] Display current tcp memory allocation in kmem cgroup
@ 2011-09-19  0:56   ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-19  0:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, kamezawa.hiroyu, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, Glauber Costa

This patch introduces kmem.tcp_current_memory file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 mm/memcontrol.c                  |   17 +++++++++++++++++
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 1ffde3e..f5a539d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -79,6 +79,7 @@ Brief summary of control files.
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
  memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
+ memory.kmem.tcp.current_memory  # show current tcp buf memory allocation
 
 1. History
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be5ab89..8c015b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -514,12 +514,29 @@ static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
 	return ret;
 }
 
+static u64 tcp_read_curmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = atomic_long_read(&sg->tcp_memory_allocated);
+
+	cgroup_unlock();
+	return ret;
+}
+
 static struct cftype tcp_files[] = {
 	{
 		.name = "kmem.tcp.max_memory",
 		.write_u64 = tcp_write_maxmem,
 		.read_u64 = tcp_read_maxmem,
 	},
+	{
+		.name = "kmem.tcp.current_memory",
+		.read_u64 = tcp_read_curmem,
+	},
 };
 
 /*
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-19  0:56   ` Glauber Costa
  (?)
@ 2011-09-21  2:23     ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-21  2:23 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

Hi people,

Any insights on this series?
Kame, is it inline with your expectations ?

Thank you all

On 09/18/2011 09:56 PM, Glauber Costa wrote:
> This patch lays down the foundation for the kernel memory component
> of the Memory Controller.
>
> As of today, I am only laying down the following files:
>
>   * memory.independent_kmem_limit
>   * memory.kmem.limit_in_bytes (currently ignored)
>   * memory.kmem.usage_in_bytes (always zero)
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Paul Menage<paul@paulmenage.org>
> CC: Greg Thelen<gthelen@google.com>
> ---
>   Documentation/cgroups/memory.txt |   30 +++++++++-
>   init/Kconfig                     |   11 ++++
>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>   3 files changed, 148 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f3c598..6f1954a 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -44,8 +44,9 @@ Features:
>    - oom-killer disable knob and oom-notifier
>    - Root cgroup has no limit controls.
>
> - Kernel memory and Hugepages are not under control yet. We just manage
> - pages on LRU. To add more controls, we have to take care of performance.
> + Hugepages is not under control yet. We just manage pages on LRU. To add more
> + controls, we have to take care of performance. Kernel memory support is work
> + in progress, and the current version provides basically functionality.
>
>   Brief summary of control files.
>
> @@ -56,8 +57,11 @@ Brief summary of control files.
>   				 (See 5.5 for details)
>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>   				 (See 5.5 for details)
> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
> +				 (See 2.7 for details)
>    memory.limit_in_bytes		 # set/show limit of memory usage
>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>    memory.failcnt			 # show the number of memory usage hits limits
>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>    memory.max_usage_in_bytes	 # show max memory usage recorded
> @@ -72,6 +76,9 @@ Brief summary of control files.
>    memory.oom_control		 # set/show oom controls.
>    memory.numa_stat		 # show the number of memory usage per numa node
>
> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> +				   independent of user limits
> +
>   1. History
>
>   The memory controller has a long history. A request for comments for the memory
> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>     zone->lru_lock, it has no lock of its own.
>
> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> +
> + With the Kernel memory extension, the Memory Controller is able to limit
> +the amount of kernel memory used by the system. Kernel memory is fundamentally
> +different than user memory, since it can't be swapped out, which makes it
> +possible to DoS the system by consuming too much of this precious resource.
> +Kernel memory limits are not imposed for the root cgroup.
> +
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
> +memory to be controlled separately.
> +
> +When kernel memory limits are not independent, the limit values set in
> +memory.kmem files are ignored.
> +
> +Currently no soft limit is implemented for kernel memory. It is future work
> +to trigger slab reclaim when those limits are reached.
> +
>   3. User Interface
>
>   0. Configuration
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..49e5839 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>   	  For those who want to have the feature enabled by default should
>   	  select this option (if, for some reason, they need to disable it
>   	  then swapaccount=0 does the trick).
> +config CGROUP_MEM_RES_CTLR_KMEM
> +	bool "Memory Resource Controller Kernel Memory accounting"
> +	depends on CGROUP_MEM_RES_CTLR
> +	default y
> +	help
> +	  The Kernel Memory extension for Memory Resource Controller can limit
> +	  the amount of memory used by kernel objects in the system. Those are
> +	  fundamentally different from the entities handled by the standard
> +	  Memory Controller, which are page-based, and can be swapped. Users of
> +	  the kmem extension can use it to guarantee that no group of processes
> +	  will ever exhaust kernel resources alone.
>
>   config CGROUP_PERF
>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebd1e86..d32e931 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>   #define do_swap_account		(0)
>   #endif
>
> -
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int do_kmem_account __read_mostly = 1;
> +#else
> +#define do_kmem_account		0
> +#endif
>   /*
>    * Statistics for memory cgroup.
>    */
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>   	 */
>   	struct res_counter memsw;
>   	/*
> +	 * the counter to account for kmem usage.
> +	 */
> +	struct res_counter kmem;
> +	/*
>   	 * Per cgroup active and inactive list, similar to the
>   	 * per zone LRU lists.
>   	 */
> @@ -321,6 +329,11 @@ struct mem_cgroup {
>   	 */
>   	unsigned long 	move_charge_at_immigrate;
>   	/*
> +	 * Should kernel memory limits be stabilished independently
> +	 * from user memory ?
> +	 */
> +	int		kmem_independent;
> +	/*
>   	 * percpu counter.
>   	 */
>   	struct mem_cgroup_stat_cpu *stat;
> @@ -388,9 +401,14 @@ enum charge_type {
>   };
>
>   /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define _OOM_TYPE		(2)
> +
> +enum mem_type {
> +	_MEM = 0,
> +	_MEMSWAP,
> +	_OOM_TYPE,
> +	_KMEM,
> +};
> +
>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>   	u64 val;
>
>   	if (!mem_cgroup_is_root(mem)) {
> +		val = 0;
> +		if (!mem->kmem_independent)
> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>   		if (!swap)
> -			return res_counter_read_u64(&mem->res, RES_USAGE);
> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>   		else
> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
> +
> +		return val;
>   	}
>
>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>   		else
>   			val = res_counter_read_u64(&mem->memsw, name);
>   		break;
> +	case _KMEM:
> +		val = res_counter_read_u64(&mem->kmem, name);
> +		break;
> +
>   	default:
>   		BUG();
>   		break;
> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>   	return 0;
>   }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cont)->kmem_independent;
> +}
> +
> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	cgroup_lock();
> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
> +	cgroup_unlock();
> +	return 0;
> +}
> +#endif
>
>   static struct cftype mem_cgroup_files[] = {
>   	{
> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>   }
>   #endif
>
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static struct cftype kmem_cgroup_files[] = {
> +	{
> +		.name = "independent_kmem_limit",
> +		.read_u64 = kmem_limit_independent_read,
> +		.write_u64 = kmem_limit_independent_write,
> +	},
> +	{
> +		.name = "kmem.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "kmem.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +};
> +
> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	int ret = 0;
> +
> +	if (!do_kmem_account)
> +		return 0;
> +
> +	if (!mem_cgroup_is_root(mem))
> +		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
> +					ARRAY_SIZE(kmem_cgroup_files));
> +	return ret;
> +};
> +
> +#else
> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +	return 0;
> +}
> +#endif
> +
>   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>   {
>   	struct mem_cgroup_per_node *pn;
> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	if (parent&&  parent->use_hierarchy) {
>   		res_counter_init(&mem->res,&parent->res);
>   		res_counter_init(&mem->memsw,&parent->memsw);
> +		res_counter_init(&mem->kmem,&parent->kmem);
>   		/*
>   		 * We increment refcnt of the parent to ensure that we can
>   		 * safely access it on res_counter_charge/uncharge.
> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	} else {
>   		res_counter_init(&mem->res, NULL);
>   		res_counter_init(&mem->memsw, NULL);
> +		res_counter_init(&mem->kmem, NULL);
>   	}
>   	mem->last_scanned_child = 0;
>   	mem->last_scanned_node = MAX_NUMNODES;
> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>
>   	if (!ret)
>   		ret = register_memsw_files(cont, ss);
> +
> +	if (!ret)
> +		ret = register_kmem_files(cont, ss);
> +
>   	return ret;
>   }
>
> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>   __setup("swapaccount=", enable_swap_account);
>
>   #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static int __init disable_kmem_account(char *s)
> +{
> +	/* consider enabled if no parameter or 1 is given */
> +	if (!strcmp(s, "1"))
> +		do_kmem_account = 1;
> +	else if (!strcmp(s, "0"))
> +		do_kmem_account = 0;
> +	return 1;
> +}
> +__setup("kmemaccount=", disable_kmem_account);
> +
> +#endif


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-21  2:23     ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-21  2:23 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

Hi people,

Any insights on this series?
Kame, is it inline with your expectations ?

Thank you all

On 09/18/2011 09:56 PM, Glauber Costa wrote:
> This patch lays down the foundation for the kernel memory component
> of the Memory Controller.
>
> As of today, I am only laying down the following files:
>
>   * memory.independent_kmem_limit
>   * memory.kmem.limit_in_bytes (currently ignored)
>   * memory.kmem.usage_in_bytes (always zero)
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Paul Menage<paul@paulmenage.org>
> CC: Greg Thelen<gthelen@google.com>
> ---
>   Documentation/cgroups/memory.txt |   30 +++++++++-
>   init/Kconfig                     |   11 ++++
>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>   3 files changed, 148 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f3c598..6f1954a 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -44,8 +44,9 @@ Features:
>    - oom-killer disable knob and oom-notifier
>    - Root cgroup has no limit controls.
>
> - Kernel memory and Hugepages are not under control yet. We just manage
> - pages on LRU. To add more controls, we have to take care of performance.
> + Hugepages is not under control yet. We just manage pages on LRU. To add more
> + controls, we have to take care of performance. Kernel memory support is work
> + in progress, and the current version provides basically functionality.
>
>   Brief summary of control files.
>
> @@ -56,8 +57,11 @@ Brief summary of control files.
>   				 (See 5.5 for details)
>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>   				 (See 5.5 for details)
> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
> +				 (See 2.7 for details)
>    memory.limit_in_bytes		 # set/show limit of memory usage
>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>    memory.failcnt			 # show the number of memory usage hits limits
>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>    memory.max_usage_in_bytes	 # show max memory usage recorded
> @@ -72,6 +76,9 @@ Brief summary of control files.
>    memory.oom_control		 # set/show oom controls.
>    memory.numa_stat		 # show the number of memory usage per numa node
>
> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> +				   independent of user limits
> +
>   1. History
>
>   The memory controller has a long history. A request for comments for the memory
> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>     zone->lru_lock, it has no lock of its own.
>
> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> +
> + With the Kernel memory extension, the Memory Controller is able to limit
> +the amount of kernel memory used by the system. Kernel memory is fundamentally
> +different than user memory, since it can't be swapped out, which makes it
> +possible to DoS the system by consuming too much of this precious resource.
> +Kernel memory limits are not imposed for the root cgroup.
> +
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
> +memory to be controlled separately.
> +
> +When kernel memory limits are not independent, the limit values set in
> +memory.kmem files are ignored.
> +
> +Currently no soft limit is implemented for kernel memory. It is future work
> +to trigger slab reclaim when those limits are reached.
> +
>   3. User Interface
>
>   0. Configuration
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..49e5839 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>   	  For those who want to have the feature enabled by default should
>   	  select this option (if, for some reason, they need to disable it
>   	  then swapaccount=0 does the trick).
> +config CGROUP_MEM_RES_CTLR_KMEM
> +	bool "Memory Resource Controller Kernel Memory accounting"
> +	depends on CGROUP_MEM_RES_CTLR
> +	default y
> +	help
> +	  The Kernel Memory extension for Memory Resource Controller can limit
> +	  the amount of memory used by kernel objects in the system. Those are
> +	  fundamentally different from the entities handled by the standard
> +	  Memory Controller, which are page-based, and can be swapped. Users of
> +	  the kmem extension can use it to guarantee that no group of processes
> +	  will ever exhaust kernel resources alone.
>
>   config CGROUP_PERF
>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebd1e86..d32e931 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>   #define do_swap_account		(0)
>   #endif
>
> -
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int do_kmem_account __read_mostly = 1;
> +#else
> +#define do_kmem_account		0
> +#endif
>   /*
>    * Statistics for memory cgroup.
>    */
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>   	 */
>   	struct res_counter memsw;
>   	/*
> +	 * the counter to account for kmem usage.
> +	 */
> +	struct res_counter kmem;
> +	/*
>   	 * Per cgroup active and inactive list, similar to the
>   	 * per zone LRU lists.
>   	 */
> @@ -321,6 +329,11 @@ struct mem_cgroup {
>   	 */
>   	unsigned long 	move_charge_at_immigrate;
>   	/*
> +	 * Should kernel memory limits be stabilished independently
> +	 * from user memory ?
> +	 */
> +	int		kmem_independent;
> +	/*
>   	 * percpu counter.
>   	 */
>   	struct mem_cgroup_stat_cpu *stat;
> @@ -388,9 +401,14 @@ enum charge_type {
>   };
>
>   /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define _OOM_TYPE		(2)
> +
> +enum mem_type {
> +	_MEM = 0,
> +	_MEMSWAP,
> +	_OOM_TYPE,
> +	_KMEM,
> +};
> +
>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>   	u64 val;
>
>   	if (!mem_cgroup_is_root(mem)) {
> +		val = 0;
> +		if (!mem->kmem_independent)
> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>   		if (!swap)
> -			return res_counter_read_u64(&mem->res, RES_USAGE);
> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>   		else
> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
> +
> +		return val;
>   	}
>
>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>   		else
>   			val = res_counter_read_u64(&mem->memsw, name);
>   		break;
> +	case _KMEM:
> +		val = res_counter_read_u64(&mem->kmem, name);
> +		break;
> +
>   	default:
>   		BUG();
>   		break;
> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>   	return 0;
>   }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cont)->kmem_independent;
> +}
> +
> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	cgroup_lock();
> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
> +	cgroup_unlock();
> +	return 0;
> +}
> +#endif
>
>   static struct cftype mem_cgroup_files[] = {
>   	{
> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>   }
>   #endif
>
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static struct cftype kmem_cgroup_files[] = {
> +	{
> +		.name = "independent_kmem_limit",
> +		.read_u64 = kmem_limit_independent_read,
> +		.write_u64 = kmem_limit_independent_write,
> +	},
> +	{
> +		.name = "kmem.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "kmem.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +};
> +
> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	int ret = 0;
> +
> +	if (!do_kmem_account)
> +		return 0;
> +
> +	if (!mem_cgroup_is_root(mem))
> +		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
> +					ARRAY_SIZE(kmem_cgroup_files));
> +	return ret;
> +};
> +
> +#else
> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +	return 0;
> +}
> +#endif
> +
>   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>   {
>   	struct mem_cgroup_per_node *pn;
> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	if (parent&&  parent->use_hierarchy) {
>   		res_counter_init(&mem->res,&parent->res);
>   		res_counter_init(&mem->memsw,&parent->memsw);
> +		res_counter_init(&mem->kmem,&parent->kmem);
>   		/*
>   		 * We increment refcnt of the parent to ensure that we can
>   		 * safely access it on res_counter_charge/uncharge.
> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	} else {
>   		res_counter_init(&mem->res, NULL);
>   		res_counter_init(&mem->memsw, NULL);
> +		res_counter_init(&mem->kmem, NULL);
>   	}
>   	mem->last_scanned_child = 0;
>   	mem->last_scanned_node = MAX_NUMNODES;
> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>
>   	if (!ret)
>   		ret = register_memsw_files(cont, ss);
> +
> +	if (!ret)
> +		ret = register_kmem_files(cont, ss);
> +
>   	return ret;
>   }
>
> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>   __setup("swapaccount=", enable_swap_account);
>
>   #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static int __init disable_kmem_account(char *s)
> +{
> +	/* consider enabled if no parameter or 1 is given */
> +	if (!strcmp(s, "1"))
> +		do_kmem_account = 1;
> +	else if (!strcmp(s, "0"))
> +		do_kmem_account = 0;
> +	return 1;
> +}
> +__setup("kmemaccount=", disable_kmem_account);
> +
> +#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-21  2:23     ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-21  2:23 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

Hi people,

Any insights on this series?
Kame, is it inline with your expectations ?

Thank you all

On 09/18/2011 09:56 PM, Glauber Costa wrote:
> This patch lays down the foundation for the kernel memory component
> of the Memory Controller.
>
> As of today, I am only laying down the following files:
>
>   * memory.independent_kmem_limit
>   * memory.kmem.limit_in_bytes (currently ignored)
>   * memory.kmem.usage_in_bytes (always zero)
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Paul Menage<paul@paulmenage.org>
> CC: Greg Thelen<gthelen@google.com>
> ---
>   Documentation/cgroups/memory.txt |   30 +++++++++-
>   init/Kconfig                     |   11 ++++
>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>   3 files changed, 148 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f3c598..6f1954a 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -44,8 +44,9 @@ Features:
>    - oom-killer disable knob and oom-notifier
>    - Root cgroup has no limit controls.
>
> - Kernel memory and Hugepages are not under control yet. We just manage
> - pages on LRU. To add more controls, we have to take care of performance.
> + Hugepages is not under control yet. We just manage pages on LRU. To add more
> + controls, we have to take care of performance. Kernel memory support is work
> + in progress, and the current version provides basically functionality.
>
>   Brief summary of control files.
>
> @@ -56,8 +57,11 @@ Brief summary of control files.
>   				 (See 5.5 for details)
>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>   				 (See 5.5 for details)
> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
> +				 (See 2.7 for details)
>    memory.limit_in_bytes		 # set/show limit of memory usage
>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>    memory.failcnt			 # show the number of memory usage hits limits
>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>    memory.max_usage_in_bytes	 # show max memory usage recorded
> @@ -72,6 +76,9 @@ Brief summary of control files.
>    memory.oom_control		 # set/show oom controls.
>    memory.numa_stat		 # show the number of memory usage per numa node
>
> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> +				   independent of user limits
> +
>   1. History
>
>   The memory controller has a long history. A request for comments for the memory
> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>     zone->lru_lock, it has no lock of its own.
>
> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> +
> + With the Kernel memory extension, the Memory Controller is able to limit
> +the amount of kernel memory used by the system. Kernel memory is fundamentally
> +different than user memory, since it can't be swapped out, which makes it
> +possible to DoS the system by consuming too much of this precious resource.
> +Kernel memory limits are not imposed for the root cgroup.
> +
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
> +memory to be controlled separately.
> +
> +When kernel memory limits are not independent, the limit values set in
> +memory.kmem files are ignored.
> +
> +Currently no soft limit is implemented for kernel memory. It is future work
> +to trigger slab reclaim when those limits are reached.
> +
>   3. User Interface
>
>   0. Configuration
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..49e5839 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>   	  For those who want to have the feature enabled by default should
>   	  select this option (if, for some reason, they need to disable it
>   	  then swapaccount=0 does the trick).
> +config CGROUP_MEM_RES_CTLR_KMEM
> +	bool "Memory Resource Controller Kernel Memory accounting"
> +	depends on CGROUP_MEM_RES_CTLR
> +	default y
> +	help
> +	  The Kernel Memory extension for Memory Resource Controller can limit
> +	  the amount of memory used by kernel objects in the system. Those are
> +	  fundamentally different from the entities handled by the standard
> +	  Memory Controller, which are page-based, and can be swapped. Users of
> +	  the kmem extension can use it to guarantee that no group of processes
> +	  will ever exhaust kernel resources alone.
>
>   config CGROUP_PERF
>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebd1e86..d32e931 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>   #define do_swap_account		(0)
>   #endif
>
> -
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int do_kmem_account __read_mostly = 1;
> +#else
> +#define do_kmem_account		0
> +#endif
>   /*
>    * Statistics for memory cgroup.
>    */
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>   	 */
>   	struct res_counter memsw;
>   	/*
> +	 * the counter to account for kmem usage.
> +	 */
> +	struct res_counter kmem;
> +	/*
>   	 * Per cgroup active and inactive list, similar to the
>   	 * per zone LRU lists.
>   	 */
> @@ -321,6 +329,11 @@ struct mem_cgroup {
>   	 */
>   	unsigned long 	move_charge_at_immigrate;
>   	/*
> +	 * Should kernel memory limits be stabilished independently
> +	 * from user memory ?
> +	 */
> +	int		kmem_independent;
> +	/*
>   	 * percpu counter.
>   	 */
>   	struct mem_cgroup_stat_cpu *stat;
> @@ -388,9 +401,14 @@ enum charge_type {
>   };
>
>   /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define _OOM_TYPE		(2)
> +
> +enum mem_type {
> +	_MEM = 0,
> +	_MEMSWAP,
> +	_OOM_TYPE,
> +	_KMEM,
> +};
> +
>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>   	u64 val;
>
>   	if (!mem_cgroup_is_root(mem)) {
> +		val = 0;
> +		if (!mem->kmem_independent)
> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>   		if (!swap)
> -			return res_counter_read_u64(&mem->res, RES_USAGE);
> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>   		else
> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
> +
> +		return val;
>   	}
>
>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>   		else
>   			val = res_counter_read_u64(&mem->memsw, name);
>   		break;
> +	case _KMEM:
> +		val = res_counter_read_u64(&mem->kmem, name);
> +		break;
> +
>   	default:
>   		BUG();
>   		break;
> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>   	return 0;
>   }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cont)->kmem_independent;
> +}
> +
> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	cgroup_lock();
> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
> +	cgroup_unlock();
> +	return 0;
> +}
> +#endif
>
>   static struct cftype mem_cgroup_files[] = {
>   	{
> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>   }
>   #endif
>
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static struct cftype kmem_cgroup_files[] = {
> +	{
> +		.name = "independent_kmem_limit",
> +		.read_u64 = kmem_limit_independent_read,
> +		.write_u64 = kmem_limit_independent_write,
> +	},
> +	{
> +		.name = "kmem.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "kmem.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +};
> +
> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	int ret = 0;
> +
> +	if (!do_kmem_account)
> +		return 0;
> +
> +	if (!mem_cgroup_is_root(mem))
> +		ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
> +					ARRAY_SIZE(kmem_cgroup_files));
> +	return ret;
> +};
> +
> +#else
> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +	return 0;
> +}
> +#endif
> +
>   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>   {
>   	struct mem_cgroup_per_node *pn;
> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	if (parent&&  parent->use_hierarchy) {
>   		res_counter_init(&mem->res,&parent->res);
>   		res_counter_init(&mem->memsw,&parent->memsw);
> +		res_counter_init(&mem->kmem,&parent->kmem);
>   		/*
>   		 * We increment refcnt of the parent to ensure that we can
>   		 * safely access it on res_counter_charge/uncharge.
> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	} else {
>   		res_counter_init(&mem->res, NULL);
>   		res_counter_init(&mem->memsw, NULL);
> +		res_counter_init(&mem->kmem, NULL);
>   	}
>   	mem->last_scanned_child = 0;
>   	mem->last_scanned_node = MAX_NUMNODES;
> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>
>   	if (!ret)
>   		ret = register_memsw_files(cont, ss);
> +
> +	if (!ret)
> +		ret = register_kmem_files(cont, ss);
> +
>   	return ret;
>   }
>
> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>   __setup("swapaccount=", enable_swap_account);
>
>   #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static int __init disable_kmem_account(char *s)
> +{
> +	/* consider enabled if no parameter or 1 is given */
> +	if (!strcmp(s, "1"))
> +		do_kmem_account = 1;
> +	else if (!strcmp(s, "0"))
> +		do_kmem_account = 0;
> +	return 1;
> +}
> +__setup("kmemaccount=", disable_kmem_account);
> +
> +#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-21 18:47     ` Greg Thelen
  -1 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-21 18:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> We aim to control the amount of kernel memory pinned at any
> time by tcp sockets. To lay the foundations for this work,
> this patch adds a pointer to the kmem_cgroup to the socket
> structure.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
...
> +void sock_update_memcg(struct sock *sk)
> +{
> +       /* right now a socket spends its whole life in the same cgroup */
> +       BUG_ON(sk->sk_cgrp);
> +
> +       rcu_read_lock();
> +       sk->sk_cgrp = mem_cgroup_from_task(current);
> +
> +       /*
> +        * We don't need to protect against anything task-related, because
> +        * we are basically stuck with the sock pointer that won't change,
> +        * even if the task that originated the socket changes cgroups.
> +        *
> +        * What we do have to guarantee, is that the chain leading us to
> +        * the top level won't change under our noses. Incrementing the
> +        * reference count via cgroup_exclude_rmdir guarantees that.
> +        */
> +       cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));

This grabs a css_get() reference, which prevents rmdir (will return
-EBUSY).  How long is this reference held?  I wonder about the case
where a process creates a socket in memcg M1 and later is moved into
memcg M2.  At that point an admin would expect to be able to 'rmdir
M1'.  I think this rmdir would return -EBUSY and I suspect it would be
difficult for the admin to understand why the rmdir of M1 failed.  It
seems that to rmdir a memcg, an admin would have to kill all processes
that allocated sockets while in M1.  Such processes may not still be
in M1.

> +       rcu_read_unlock();
> +}

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-21 18:47     ` Greg Thelen
  0 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-21 18:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> We aim to control the amount of kernel memory pinned at any
> time by tcp sockets. To lay the foundations for this work,
> this patch adds a pointer to the kmem_cgroup to the socket
> structure.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
...
> +void sock_update_memcg(struct sock *sk)
> +{
> +       /* right now a socket spends its whole life in the same cgroup */
> +       BUG_ON(sk->sk_cgrp);
> +
> +       rcu_read_lock();
> +       sk->sk_cgrp = mem_cgroup_from_task(current);
> +
> +       /*
> +        * We don't need to protect against anything task-related, because
> +        * we are basically stuck with the sock pointer that won't change,
> +        * even if the task that originated the socket changes cgroups.
> +        *
> +        * What we do have to guarantee, is that the chain leading us to
> +        * the top level won't change under our noses. Incrementing the
> +        * reference count via cgroup_exclude_rmdir guarantees that.
> +        */
> +       cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));

This grabs a css_get() reference, which prevents rmdir (will return
-EBUSY).  How long is this reference held?  I wonder about the case
where a process creates a socket in memcg M1 and later is moved into
memcg M2.  At that point an admin would expect to be able to 'rmdir
M1'.  I think this rmdir would return -EBUSY and I suspect it would be
difficult for the admin to understand why the rmdir of M1 failed.  It
seems that to rmdir a memcg, an admin would have to kill all processes
that allocated sockets while in M1.  Such processes may not still be
in M1.

> +       rcu_read_unlock();
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-21 18:47     ` Greg Thelen
  (?)
@ 2011-09-21 18:59       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-21 18:59 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On 09/21/2011 03:47 PM, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
> ...
>> +void sock_update_memcg(struct sock *sk)
>> +{
>> +       /* right now a socket spends its whole life in the same cgroup */
>> +       BUG_ON(sk->sk_cgrp);
>> +
>> +       rcu_read_lock();
>> +       sk->sk_cgrp = mem_cgroup_from_task(current);
>> +
>> +       /*
>> +        * We don't need to protect against anything task-related, because
>> +        * we are basically stuck with the sock pointer that won't change,
>> +        * even if the task that originated the socket changes cgroups.
>> +        *
>> +        * What we do have to guarantee, is that the chain leading us to
>> +        * the top level won't change under our noses. Incrementing the
>> +        * reference count via cgroup_exclude_rmdir guarantees that.
>> +        */
>> +       cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
>
> This grabs a css_get() reference, which prevents rmdir (will return
> -EBUSY).
Yes.

  How long is this reference held?
For the socket lifetime.

> I wonder about the case
> where a process creates a socket in memcg M1 and later is moved into
> memcg M2.  At that point an admin would expect to be able to 'rmdir
> M1'.  I think this rmdir would return -EBUSY and I suspect it would be
> difficult for the admin to understand why the rmdir of M1 failed.  It
> seems that to rmdir a memcg, an admin would have to kill all processes
> that allocated sockets while in M1.  Such processes may not still be
> in M1.
>
>> +       rcu_read_unlock();
>> +}
I agree. But also, don't see too much ways around it without 
implementing full task migration.

Right now I am working under the assumption that tasks are long lived 
inside the cgroup. Migration potentially introduces some nasty locking 
problems in the mem_schedule path.

Also, unless I am missing something, the memcg already has the policy of
not carrying charges around, probably because of this very same complexity.

True that at least it won't EBUSY you... But I think this is at least a 
way to guarantee that the cgroup under our nose won't disappear in the 
middle of our allocations.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-21 18:59       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-21 18:59 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On 09/21/2011 03:47 PM, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
> ...
>> +void sock_update_memcg(struct sock *sk)
>> +{
>> +       /* right now a socket spends its whole life in the same cgroup */
>> +       BUG_ON(sk->sk_cgrp);
>> +
>> +       rcu_read_lock();
>> +       sk->sk_cgrp = mem_cgroup_from_task(current);
>> +
>> +       /*
>> +        * We don't need to protect against anything task-related, because
>> +        * we are basically stuck with the sock pointer that won't change,
>> +        * even if the task that originated the socket changes cgroups.
>> +        *
>> +        * What we do have to guarantee, is that the chain leading us to
>> +        * the top level won't change under our noses. Incrementing the
>> +        * reference count via cgroup_exclude_rmdir guarantees that.
>> +        */
>> +       cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
>
> This grabs a css_get() reference, which prevents rmdir (will return
> -EBUSY).
Yes.

  How long is this reference held?
For the socket lifetime.

> I wonder about the case
> where a process creates a socket in memcg M1 and later is moved into
> memcg M2.  At that point an admin would expect to be able to 'rmdir
> M1'.  I think this rmdir would return -EBUSY and I suspect it would be
> difficult for the admin to understand why the rmdir of M1 failed.  It
> seems that to rmdir a memcg, an admin would have to kill all processes
> that allocated sockets while in M1.  Such processes may not still be
> in M1.
>
>> +       rcu_read_unlock();
>> +}
I agree. But also, don't see too much ways around it without 
implementing full task migration.

Right now I am working under the assumption that tasks are long lived 
inside the cgroup. Migration potentially introduces some nasty locking 
problems in the mem_schedule path.

Also, unless I am missing something, the memcg already has the policy of
not carrying charges around, probably because of this very same complexity.

True that at least it won't EBUSY you... But I think this is at least a 
way to guarantee that the cgroup under our nose won't disappear in the 
middle of our allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-21 18:59       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-21 18:59 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On 09/21/2011 03:47 PM, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
> ...
>> +void sock_update_memcg(struct sock *sk)
>> +{
>> +       /* right now a socket spends its whole life in the same cgroup */
>> +       BUG_ON(sk->sk_cgrp);
>> +
>> +       rcu_read_lock();
>> +       sk->sk_cgrp = mem_cgroup_from_task(current);
>> +
>> +       /*
>> +        * We don't need to protect against anything task-related, because
>> +        * we are basically stuck with the sock pointer that won't change,
>> +        * even if the task that originated the socket changes cgroups.
>> +        *
>> +        * What we do have to guarantee, is that the chain leading us to
>> +        * the top level won't change under our noses. Incrementing the
>> +        * reference count via cgroup_exclude_rmdir guarantees that.
>> +        */
>> +       cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
>
> This grabs a css_get() reference, which prevents rmdir (will return
> -EBUSY).
Yes.

  How long is this reference held?
For the socket lifetime.

> I wonder about the case
> where a process creates a socket in memcg M1 and later is moved into
> memcg M2.  At that point an admin would expect to be able to 'rmdir
> M1'.  I think this rmdir would return -EBUSY and I suspect it would be
> difficult for the admin to understand why the rmdir of M1 failed.  It
> seems that to rmdir a memcg, an admin would have to kill all processes
> that allocated sockets while in M1.  Such processes may not still be
> in M1.
>
>> +       rcu_read_unlock();
>> +}
I agree. But also, don't see too much ways around it without 
implementing full task migration.

Right now I am working under the assumption that tasks are long lived 
inside the cgroup. Migration potentially introduces some nasty locking 
problems in the mem_schedule path.

Also, unless I am missing something, the memcg already has the policy of
not carrying charges around, probably because of this very same complexity.

True that at least it won't EBUSY you... But I think this is at least a 
way to guarantee that the cgroup under our nose won't disappear in the 
middle of our allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-21  2:23     ` Glauber Costa
@ 2011-09-22  3:17       ` Balbir Singh
  -1 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-22  3:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa <glommer@parallels.com> wrote:
>
> Hi people,
>
> Any insights on this series?
> Kame, is it inline with your expectations ?
>
> Thank you all
>
> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>
>> This patch lays down the foundation for the kernel memory component
>> of the Memory Controller.
>>
>> As of today, I am only laying down the following files:
>>
>>  * memory.independent_kmem_limit
>>  * memory.kmem.limit_in_bytes (currently ignored)
>>  * memory.kmem.usage_in_bytes (always zero)
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Menage<paul@paulmenage.org>
>> CC: Greg Thelen<gthelen@google.com>
>> ---
>>  Documentation/cgroups/memory.txt |   30 +++++++++-
>>  init/Kconfig                     |   11 ++++
>>  mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 148 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f3c598..6f1954a 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -44,8 +44,9 @@ Features:
>>   - oom-killer disable knob and oom-notifier
>>   - Root cgroup has no limit controls.
>>
>> - Kernel memory and Hugepages are not under control yet. We just manage
>> - pages on LRU. To add more controls, we have to take care of performance.
>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>> + controls, we have to take care of performance. Kernel memory support is work
>> + in progress, and the current version provides basically functionality.
>>
>>  Brief summary of control files.
>>
>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>                                 (See 5.5 for details)
>>   memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>                                 (See 5.5 for details)
>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>> +                                (See 2.7 for details)
>>   memory.limit_in_bytes                 # set/show limit of memory usage
>>   memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>   memory.failcnt                        # show the number of memory usage hits limits
>>   memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>   memory.max_usage_in_bytes     # show max memory usage recorded
>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>   memory.oom_control            # set/show oom controls.
>>   memory.numa_stat              # show the number of memory usage per numa node
>>
>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>> +                                  independent of user limits
>> +
>>  1. History
>>
>>  The memory controller has a long history. A request for comments for the memory
>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>    zone->lru_lock, it has no lock of its own.
>>
>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>> +
>> + With the Kernel memory extension, the Memory Controller is able to limit
>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>> +different than user memory, since it can't be swapped out, which makes it
>> +possible to DoS the system by consuming too much of this precious resource.
>> +Kernel memory limits are not imposed for the root cgroup.
>> +
>> +Memory limits as specified by the standard Memory Controller may or may not
>> +take kernel memory into consideration. This is achieved through the file
>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>> +memory to be controlled separately.
>> +
>> +When kernel memory limits are not independent, the limit values set in
>> +memory.kmem files are ignored.
>> +
>> +Currently no soft limit is implemented for kernel memory. It is future work
>> +to trigger slab reclaim when those limits are reached.
>> +

Ying Han was also looking into this (cc'ing her)

>>  3. User Interface
>>
>>  0. Configuration
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..49e5839 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>          For those who want to have the feature enabled by default should
>>          select this option (if, for some reason, they need to disable it
>>          then swapaccount=0 does the trick).
>> +config CGROUP_MEM_RES_CTLR_KMEM
>> +       bool "Memory Resource Controller Kernel Memory accounting"
>> +       depends on CGROUP_MEM_RES_CTLR
>> +       default y
>> +       help
>> +         The Kernel Memory extension for Memory Resource Controller can limit
>> +         the amount of memory used by kernel objects in the system. Those are
>> +         fundamentally different from the entities handled by the standard
>> +         Memory Controller, which are page-based, and can be swapped. Users of
>> +         the kmem extension can use it to guarantee that no group of processes
>> +         will ever exhaust kernel resources alone.
>>
>>  config CGROUP_PERF
>>        bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebd1e86..d32e931 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>  #define do_swap_account               (0)
>>  #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account                0
>> +#endif
>>  /*
>>   * Statistics for memory cgroup.
>>   */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>         */
>>        struct res_counter memsw;
>>        /*
>> +        * the counter to account for kmem usage.
>> +        */
>> +       struct res_counter kmem;
>> +       /*
>>         * Per cgroup active and inactive list, similar to the
>>         * per zone LRU lists.
>>         */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>         */
>>        unsigned long   move_charge_at_immigrate;
>>        /*
>> +        * Should kernel memory limits be stabilished independently
>> +        * from user memory ?
>> +        */
>> +       int             kmem_independent;
>> +       /*
>>         * percpu counter.
>>         */
>>        struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>  };
>>
>>  /* for encoding cft->private value on file */
>> -#define _MEM                   (0)
>> -#define _MEMSWAP               (1)
>> -#define _OOM_TYPE              (2)
>> +
>> +enum mem_type {
>> +       _MEM = 0,
>> +       _MEMSWAP,
>> +       _OOM_TYPE,
>> +       _KMEM,
>> +};
>> +
>>  #define MEMFILE_PRIVATE(x, val)       (((x)<<  16) | (val))
>>  #define MEMFILE_TYPE(val)     (((val)>>  16)&  0xffff)
>>  #define MEMFILE_ATTR(val)     ((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>        u64 val;
>>
>>        if (!mem_cgroup_is_root(mem)) {
>> +               val = 0;
>> +               if (!mem->kmem_independent)
>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>                if (!swap)
>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>                else
>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +               return val;
>>        }
>>
>>        val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>                else
>>                        val = res_counter_read_u64(&mem->memsw, name);
>>                break;
>> +       case _KMEM:
>> +               val = res_counter_read_u64(&mem->kmem, name);
>> +               break;
>> +
>>        default:
>>                BUG();
>>                break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>        return 0;
>>  }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +                                       u64 val)
>> +{
>> +       cgroup_lock();
>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +       cgroup_unlock();
>> +       return 0;
>> +}

I know we have a lot of pending xxx_from_cont() and struct cgroup
*cont, can we move it to memcg notation to be more consistent with our
usage. There is a patch to convert old usage

>> +#endif
>>
>>  static struct cftype mem_cgroup_files[] = {
>>        {
>> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>  }
>>  #endif
>>
>> +
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static struct cftype kmem_cgroup_files[] = {
>> +       {
>> +               .name = "independent_kmem_limit",
>> +               .read_u64 = kmem_limit_independent_read,
>> +               .write_u64 = kmem_limit_independent_write,
>> +       },
>> +       {
>> +               .name = "kmem.usage_in_bytes",
>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
>> +               .read_u64 = mem_cgroup_read,
>> +       },
>> +       {
>> +               .name = "kmem.limit_in_bytes",
>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
>> +               .read_u64 = mem_cgroup_read,
>> +       },
>> +};
>> +
>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>> +{
>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> +       int ret = 0;
>> +
>> +       if (!do_kmem_account)
>> +               return 0;
>> +
>> +       if (!mem_cgroup_is_root(mem))
>> +               ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
>> +                                       ARRAY_SIZE(kmem_cgroup_files));
>> +       return ret;
>> +};
>> +
>> +#else
>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>> +{
>> +       return 0;
>> +}
>> +#endif
>> +
>>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>  {
>>        struct mem_cgroup_per_node *pn;
>> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>        if (parent&&  parent->use_hierarchy) {
>>                res_counter_init(&mem->res,&parent->res);
>>                res_counter_init(&mem->memsw,&parent->memsw);
>> +               res_counter_init(&mem->kmem,&parent->kmem);
>>                /*
>>                 * We increment refcnt of the parent to ensure that we can
>>                 * safely access it on res_counter_charge/uncharge.
>> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>        } else {
>>                res_counter_init(&mem->res, NULL);
>>                res_counter_init(&mem->memsw, NULL);
>> +               res_counter_init(&mem->kmem, NULL);
>>        }
>>        mem->last_scanned_child = 0;
>>        mem->last_scanned_node = MAX_NUMNODES;
>> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>>
>>        if (!ret)
>>                ret = register_memsw_files(cont, ss);
>> +
>> +       if (!ret)
>> +               ret = register_kmem_files(cont, ss);
>> +
>>        return ret;
>>  }
>>
>> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>>  __setup("swapaccount=", enable_swap_account);
>>
>>  #endif
>> +
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static int __init disable_kmem_account(char *s)
>> +{
>> +       /* consider enabled if no parameter or 1 is given */
>> +       if (!strcmp(s, "1"))
>> +               do_kmem_account = 1;
>> +       else if (!strcmp(s, "0"))
>> +               do_kmem_account = 0;
>> +       return 1;
>> +}
>> +__setup("kmemaccount=", disable_kmem_account);
>> +
>> +#endif

The infrastructure looks OK, we need better integration with
statistics for kmem usage.

Balbir Singh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-22  3:17       ` Balbir Singh
  0 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-22  3:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa <glommer@parallels.com> wrote:
>
> Hi people,
>
> Any insights on this series?
> Kame, is it inline with your expectations ?
>
> Thank you all
>
> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>
>> This patch lays down the foundation for the kernel memory component
>> of the Memory Controller.
>>
>> As of today, I am only laying down the following files:
>>
>>  * memory.independent_kmem_limit
>>  * memory.kmem.limit_in_bytes (currently ignored)
>>  * memory.kmem.usage_in_bytes (always zero)
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Menage<paul@paulmenage.org>
>> CC: Greg Thelen<gthelen@google.com>
>> ---
>>  Documentation/cgroups/memory.txt |   30 +++++++++-
>>  init/Kconfig                     |   11 ++++
>>  mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 148 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f3c598..6f1954a 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -44,8 +44,9 @@ Features:
>>   - oom-killer disable knob and oom-notifier
>>   - Root cgroup has no limit controls.
>>
>> - Kernel memory and Hugepages are not under control yet. We just manage
>> - pages on LRU. To add more controls, we have to take care of performance.
>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>> + controls, we have to take care of performance. Kernel memory support is work
>> + in progress, and the current version provides basically functionality.
>>
>>  Brief summary of control files.
>>
>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>                                 (See 5.5 for details)
>>   memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>                                 (See 5.5 for details)
>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>> +                                (See 2.7 for details)
>>   memory.limit_in_bytes                 # set/show limit of memory usage
>>   memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>   memory.failcnt                        # show the number of memory usage hits limits
>>   memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>   memory.max_usage_in_bytes     # show max memory usage recorded
>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>   memory.oom_control            # set/show oom controls.
>>   memory.numa_stat              # show the number of memory usage per numa node
>>
>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>> +                                  independent of user limits
>> +
>>  1. History
>>
>>  The memory controller has a long history. A request for comments for the memory
>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>    zone->lru_lock, it has no lock of its own.
>>
>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>> +
>> + With the Kernel memory extension, the Memory Controller is able to limit
>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>> +different than user memory, since it can't be swapped out, which makes it
>> +possible to DoS the system by consuming too much of this precious resource.
>> +Kernel memory limits are not imposed for the root cgroup.
>> +
>> +Memory limits as specified by the standard Memory Controller may or may not
>> +take kernel memory into consideration. This is achieved through the file
>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>> +memory to be controlled separately.
>> +
>> +When kernel memory limits are not independent, the limit values set in
>> +memory.kmem files are ignored.
>> +
>> +Currently no soft limit is implemented for kernel memory. It is future work
>> +to trigger slab reclaim when those limits are reached.
>> +

Ying Han was also looking into this (cc'ing her)

>>  3. User Interface
>>
>>  0. Configuration
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..49e5839 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>          For those who want to have the feature enabled by default should
>>          select this option (if, for some reason, they need to disable it
>>          then swapaccount=0 does the trick).
>> +config CGROUP_MEM_RES_CTLR_KMEM
>> +       bool "Memory Resource Controller Kernel Memory accounting"
>> +       depends on CGROUP_MEM_RES_CTLR
>> +       default y
>> +       help
>> +         The Kernel Memory extension for Memory Resource Controller can limit
>> +         the amount of memory used by kernel objects in the system. Those are
>> +         fundamentally different from the entities handled by the standard
>> +         Memory Controller, which are page-based, and can be swapped. Users of
>> +         the kmem extension can use it to guarantee that no group of processes
>> +         will ever exhaust kernel resources alone.
>>
>>  config CGROUP_PERF
>>        bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebd1e86..d32e931 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>  #define do_swap_account               (0)
>>  #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account                0
>> +#endif
>>  /*
>>   * Statistics for memory cgroup.
>>   */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>         */
>>        struct res_counter memsw;
>>        /*
>> +        * the counter to account for kmem usage.
>> +        */
>> +       struct res_counter kmem;
>> +       /*
>>         * Per cgroup active and inactive list, similar to the
>>         * per zone LRU lists.
>>         */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>         */
>>        unsigned long   move_charge_at_immigrate;
>>        /*
>> +        * Should kernel memory limits be stabilished independently
>> +        * from user memory ?
>> +        */
>> +       int             kmem_independent;
>> +       /*
>>         * percpu counter.
>>         */
>>        struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>  };
>>
>>  /* for encoding cft->private value on file */
>> -#define _MEM                   (0)
>> -#define _MEMSWAP               (1)
>> -#define _OOM_TYPE              (2)
>> +
>> +enum mem_type {
>> +       _MEM = 0,
>> +       _MEMSWAP,
>> +       _OOM_TYPE,
>> +       _KMEM,
>> +};
>> +
>>  #define MEMFILE_PRIVATE(x, val)       (((x)<<  16) | (val))
>>  #define MEMFILE_TYPE(val)     (((val)>>  16)&  0xffff)
>>  #define MEMFILE_ATTR(val)     ((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>        u64 val;
>>
>>        if (!mem_cgroup_is_root(mem)) {
>> +               val = 0;
>> +               if (!mem->kmem_independent)
>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>                if (!swap)
>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>                else
>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +               return val;
>>        }
>>
>>        val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>                else
>>                        val = res_counter_read_u64(&mem->memsw, name);
>>                break;
>> +       case _KMEM:
>> +               val = res_counter_read_u64(&mem->kmem, name);
>> +               break;
>> +
>>        default:
>>                BUG();
>>                break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>        return 0;
>>  }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +                                       u64 val)
>> +{
>> +       cgroup_lock();
>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +       cgroup_unlock();
>> +       return 0;
>> +}

I know we have a lot of pending xxx_from_cont() and struct cgroup
*cont, can we move it to memcg notation to be more consistent with our
usage. There is a patch to convert old usage

>> +#endif
>>
>>  static struct cftype mem_cgroup_files[] = {
>>        {
>> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>  }
>>  #endif
>>
>> +
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static struct cftype kmem_cgroup_files[] = {
>> +       {
>> +               .name = "independent_kmem_limit",
>> +               .read_u64 = kmem_limit_independent_read,
>> +               .write_u64 = kmem_limit_independent_write,
>> +       },
>> +       {
>> +               .name = "kmem.usage_in_bytes",
>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
>> +               .read_u64 = mem_cgroup_read,
>> +       },
>> +       {
>> +               .name = "kmem.limit_in_bytes",
>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
>> +               .read_u64 = mem_cgroup_read,
>> +       },
>> +};
>> +
>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>> +{
>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> +       int ret = 0;
>> +
>> +       if (!do_kmem_account)
>> +               return 0;
>> +
>> +       if (!mem_cgroup_is_root(mem))
>> +               ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
>> +                                       ARRAY_SIZE(kmem_cgroup_files));
>> +       return ret;
>> +};
>> +
>> +#else
>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>> +{
>> +       return 0;
>> +}
>> +#endif
>> +
>>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>  {
>>        struct mem_cgroup_per_node *pn;
>> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>        if (parent&&  parent->use_hierarchy) {
>>                res_counter_init(&mem->res,&parent->res);
>>                res_counter_init(&mem->memsw,&parent->memsw);
>> +               res_counter_init(&mem->kmem,&parent->kmem);
>>                /*
>>                 * We increment refcnt of the parent to ensure that we can
>>                 * safely access it on res_counter_charge/uncharge.
>> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>        } else {
>>                res_counter_init(&mem->res, NULL);
>>                res_counter_init(&mem->memsw, NULL);
>> +               res_counter_init(&mem->kmem, NULL);
>>        }
>>        mem->last_scanned_child = 0;
>>        mem->last_scanned_node = MAX_NUMNODES;
>> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>>
>>        if (!ret)
>>                ret = register_memsw_files(cont, ss);
>> +
>> +       if (!ret)
>> +               ret = register_kmem_files(cont, ss);
>> +
>>        return ret;
>>  }
>>
>> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>>  __setup("swapaccount=", enable_swap_account);
>>
>>  #endif
>> +
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static int __init disable_kmem_account(char *s)
>> +{
>> +       /* consider enabled if no parameter or 1 is given */
>> +       if (!strcmp(s, "1"))
>> +               do_kmem_account = 1;
>> +       else if (!strcmp(s, "0"))
>> +               do_kmem_account = 0;
>> +       return 1;
>> +}
>> +__setup("kmemaccount=", disable_kmem_account);
>> +
>> +#endif

The infrastructure looks OK, we need better integration with
statistics for kmem usage.

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-22  3:17       ` Balbir Singh
  (?)
@ 2011-09-22  3:19         ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-22  3:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On 09/22/2011 12:17 AM, Balbir Singh wrote:
> On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Hi people,
>>
>> Any insights on this series?
>> Kame, is it inline with your expectations ?
>>
>> Thank you all
>>
>> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>>
>>> This patch lays down the foundation for the kernel memory component
>>> of the Memory Controller.
>>>
>>> As of today, I am only laying down the following files:
>>>
>>>   * memory.independent_kmem_limit
>>>   * memory.kmem.limit_in_bytes (currently ignored)
>>>   * memory.kmem.usage_in_bytes (always zero)
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: Paul Menage<paul@paulmenage.org>
>>> CC: Greg Thelen<gthelen@google.com>
>>> ---
>>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>>   init/Kconfig                     |   11 ++++
>>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 6f3c598..6f1954a 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -44,8 +44,9 @@ Features:
>>>    - oom-killer disable knob and oom-notifier
>>>    - Root cgroup has no limit controls.
>>>
>>> - Kernel memory and Hugepages are not under control yet. We just manage
>>> - pages on LRU. To add more controls, we have to take care of performance.
>>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>>> + controls, we have to take care of performance. Kernel memory support is work
>>> + in progress, and the current version provides basically functionality.
>>>
>>>   Brief summary of control files.
>>>
>>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>>                                  (See 5.5 for details)
>>>    memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>>                                  (See 5.5 for details)
>>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>>> +                                (See 2.7 for details)
>>>    memory.limit_in_bytes                 # set/show limit of memory usage
>>>    memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>>    memory.failcnt                        # show the number of memory usage hits limits
>>>    memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>>    memory.max_usage_in_bytes     # show max memory usage recorded
>>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>>    memory.oom_control            # set/show oom controls.
>>>    memory.numa_stat              # show the number of memory usage per numa node
>>>
>>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>> +                                  independent of user limits
>>> +
>>>   1. History
>>>
>>>   The memory controller has a long history. A request for comments for the memory
>>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>>     zone->lru_lock, it has no lock of its own.
>>>
>>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>> +
>>> + With the Kernel memory extension, the Memory Controller is able to limit
>>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>>> +different than user memory, since it can't be swapped out, which makes it
>>> +possible to DoS the system by consuming too much of this precious resource.
>>> +Kernel memory limits are not imposed for the root cgroup.
>>> +
>>> +Memory limits as specified by the standard Memory Controller may or may not
>>> +take kernel memory into consideration. This is achieved through the file
>>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>>> +memory to be controlled separately.
>>> +
>>> +When kernel memory limits are not independent, the limit values set in
>>> +memory.kmem files are ignored.
>>> +
>>> +Currently no soft limit is implemented for kernel memory. It is future work
>>> +to trigger slab reclaim when those limits are reached.
>>> +
>
> Ying Han was also looking into this (cc'ing her)
>
>>>   3. User Interface
>>>
>>>   0. Configuration
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..49e5839 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           For those who want to have the feature enabled by default should
>>>           select this option (if, for some reason, they need to disable it
>>>           then swapaccount=0 does the trick).
>>> +config CGROUP_MEM_RES_CTLR_KMEM
>>> +       bool "Memory Resource Controller Kernel Memory accounting"
>>> +       depends on CGROUP_MEM_RES_CTLR
>>> +       default y
>>> +       help
>>> +         The Kernel Memory extension for Memory Resource Controller can limit
>>> +         the amount of memory used by kernel objects in the system. Those are
>>> +         fundamentally different from the entities handled by the standard
>>> +         Memory Controller, which are page-based, and can be swapped. Users of
>>> +         the kmem extension can use it to guarantee that no group of processes
>>> +         will ever exhaust kernel resources alone.
>>>
>>>   config CGROUP_PERF
>>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ebd1e86..d32e931 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>>   #define do_swap_account               (0)
>>>   #endif
>>>
>>> -
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +int do_kmem_account __read_mostly = 1;
>>> +#else
>>> +#define do_kmem_account                0
>>> +#endif
>>>   /*
>>>    * Statistics for memory cgroup.
>>>    */
>>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>>          */
>>>         struct res_counter memsw;
>>>         /*
>>> +        * the counter to account for kmem usage.
>>> +        */
>>> +       struct res_counter kmem;
>>> +       /*
>>>          * Per cgroup active and inactive list, similar to the
>>>          * per zone LRU lists.
>>>          */
>>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>>          */
>>>         unsigned long   move_charge_at_immigrate;
>>>         /*
>>> +        * Should kernel memory limits be stabilished independently
>>> +        * from user memory ?
>>> +        */
>>> +       int             kmem_independent;
>>> +       /*
>>>          * percpu counter.
>>>          */
>>>         struct mem_cgroup_stat_cpu *stat;
>>> @@ -388,9 +401,14 @@ enum charge_type {
>>>   };
>>>
>>>   /* for encoding cft->private value on file */
>>> -#define _MEM                   (0)
>>> -#define _MEMSWAP               (1)
>>> -#define _OOM_TYPE              (2)
>>> +
>>> +enum mem_type {
>>> +       _MEM = 0,
>>> +       _MEMSWAP,
>>> +       _OOM_TYPE,
>>> +       _KMEM,
>>> +};
>>> +
>>>   #define MEMFILE_PRIVATE(x, val)       (((x)<<    16) | (val))
>>>   #define MEMFILE_TYPE(val)     (((val)>>    16)&    0xffff)
>>>   #define MEMFILE_ATTR(val)     ((val)&    0xffff)
>>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>>         u64 val;
>>>
>>>         if (!mem_cgroup_is_root(mem)) {
>>> +               val = 0;
>>> +               if (!mem->kmem_independent)
>>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>>                 if (!swap)
>>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>>                 else
>>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +
>>> +               return val;
>>>         }
>>>
>>>         val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>>                 else
>>>                         val = res_counter_read_u64(&mem->memsw, name);
>>>                 break;
>>> +       case _KMEM:
>>> +               val = res_counter_read_u64(&mem->kmem, name);
>>> +               break;
>>> +
>>>         default:
>>>                 BUG();
>>>                 break;
>>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>>         return 0;
>>>   }
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>>> +{
>>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>>> +}
>>> +
>>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>>> +                                       u64 val)
>>> +{
>>> +       cgroup_lock();
>>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>>> +       cgroup_unlock();
>>> +       return 0;
>>> +}
>
> I know we have a lot of pending xxx_from_cont() and struct cgroup
> *cont, can we move it to memcg notation to be more consistent with our
> usage. There is a patch to convert old usage
>
>>> +#endif
>>>
>>>   static struct cftype mem_cgroup_files[] = {
>>>         {
>>> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>>   }
>>>   #endif
>>>
>>> +
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static struct cftype kmem_cgroup_files[] = {
>>> +       {
>>> +               .name = "independent_kmem_limit",
>>> +               .read_u64 = kmem_limit_independent_read,
>>> +               .write_u64 = kmem_limit_independent_write,
>>> +       },
>>> +       {
>>> +               .name = "kmem.usage_in_bytes",
>>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
>>> +               .read_u64 = mem_cgroup_read,
>>> +       },
>>> +       {
>>> +               .name = "kmem.limit_in_bytes",
>>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
>>> +               .read_u64 = mem_cgroup_read,
>>> +       },
>>> +};
>>> +
>>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>> +{
>>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>>> +       int ret = 0;
>>> +
>>> +       if (!do_kmem_account)
>>> +               return 0;
>>> +
>>> +       if (!mem_cgroup_is_root(mem))
>>> +               ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
>>> +                                       ARRAY_SIZE(kmem_cgroup_files));
>>> +       return ret;
>>> +};
>>> +
>>> +#else
>>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>> +{
>>> +       return 0;
>>> +}
>>> +#endif
>>> +
>>>   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>>   {
>>>         struct mem_cgroup_per_node *pn;
>>> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>>         if (parent&&    parent->use_hierarchy) {
>>>                 res_counter_init(&mem->res,&parent->res);
>>>                 res_counter_init(&mem->memsw,&parent->memsw);
>>> +               res_counter_init(&mem->kmem,&parent->kmem);
>>>                 /*
>>>                  * We increment refcnt of the parent to ensure that we can
>>>                  * safely access it on res_counter_charge/uncharge.
>>> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>>         } else {
>>>                 res_counter_init(&mem->res, NULL);
>>>                 res_counter_init(&mem->memsw, NULL);
>>> +               res_counter_init(&mem->kmem, NULL);
>>>         }
>>>         mem->last_scanned_child = 0;
>>>         mem->last_scanned_node = MAX_NUMNODES;
>>> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>>>
>>>         if (!ret)
>>>                 ret = register_memsw_files(cont, ss);
>>> +
>>> +       if (!ret)
>>> +               ret = register_kmem_files(cont, ss);
>>> +
>>>         return ret;
>>>   }
>>>
>>> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>>>   __setup("swapaccount=", enable_swap_account);
>>>
>>>   #endif
>>> +
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static int __init disable_kmem_account(char *s)
>>> +{
>>> +       /* consider enabled if no parameter or 1 is given */
>>> +       if (!strcmp(s, "1"))
>>> +               do_kmem_account = 1;
>>> +       else if (!strcmp(s, "0"))
>>> +               do_kmem_account = 0;
>>> +       return 1;
>>> +}
>>> +__setup("kmemaccount=", disable_kmem_account);
>>> +
>>> +#endif
>
> The infrastructure looks OK, we need better integration with
> statistics for kmem usage.
>
> Balbir Singh
Hello Balbir,

Thank you for your comments.
I agree here. With this patch, however, I am only trying to lay down the 
foundations needed for the rest of the patches, that touch tcp memory 
pressure conditions.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-22  3:19         ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-22  3:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On 09/22/2011 12:17 AM, Balbir Singh wrote:
> On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Hi people,
>>
>> Any insights on this series?
>> Kame, is it inline with your expectations ?
>>
>> Thank you all
>>
>> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>>
>>> This patch lays down the foundation for the kernel memory component
>>> of the Memory Controller.
>>>
>>> As of today, I am only laying down the following files:
>>>
>>>   * memory.independent_kmem_limit
>>>   * memory.kmem.limit_in_bytes (currently ignored)
>>>   * memory.kmem.usage_in_bytes (always zero)
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: Paul Menage<paul@paulmenage.org>
>>> CC: Greg Thelen<gthelen@google.com>
>>> ---
>>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>>   init/Kconfig                     |   11 ++++
>>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 6f3c598..6f1954a 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -44,8 +44,9 @@ Features:
>>>    - oom-killer disable knob and oom-notifier
>>>    - Root cgroup has no limit controls.
>>>
>>> - Kernel memory and Hugepages are not under control yet. We just manage
>>> - pages on LRU. To add more controls, we have to take care of performance.
>>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>>> + controls, we have to take care of performance. Kernel memory support is work
>>> + in progress, and the current version provides basically functionality.
>>>
>>>   Brief summary of control files.
>>>
>>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>>                                  (See 5.5 for details)
>>>    memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>>                                  (See 5.5 for details)
>>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>>> +                                (See 2.7 for details)
>>>    memory.limit_in_bytes                 # set/show limit of memory usage
>>>    memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>>    memory.failcnt                        # show the number of memory usage hits limits
>>>    memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>>    memory.max_usage_in_bytes     # show max memory usage recorded
>>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>>    memory.oom_control            # set/show oom controls.
>>>    memory.numa_stat              # show the number of memory usage per numa node
>>>
>>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>> +                                  independent of user limits
>>> +
>>>   1. History
>>>
>>>   The memory controller has a long history. A request for comments for the memory
>>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>>     zone->lru_lock, it has no lock of its own.
>>>
>>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>> +
>>> + With the Kernel memory extension, the Memory Controller is able to limit
>>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>>> +different than user memory, since it can't be swapped out, which makes it
>>> +possible to DoS the system by consuming too much of this precious resource.
>>> +Kernel memory limits are not imposed for the root cgroup.
>>> +
>>> +Memory limits as specified by the standard Memory Controller may or may not
>>> +take kernel memory into consideration. This is achieved through the file
>>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>>> +memory to be controlled separately.
>>> +
>>> +When kernel memory limits are not independent, the limit values set in
>>> +memory.kmem files are ignored.
>>> +
>>> +Currently no soft limit is implemented for kernel memory. It is future work
>>> +to trigger slab reclaim when those limits are reached.
>>> +
>
> Ying Han was also looking into this (cc'ing her)
>
>>>   3. User Interface
>>>
>>>   0. Configuration
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..49e5839 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           For those who want to have the feature enabled by default should
>>>           select this option (if, for some reason, they need to disable it
>>>           then swapaccount=0 does the trick).
>>> +config CGROUP_MEM_RES_CTLR_KMEM
>>> +       bool "Memory Resource Controller Kernel Memory accounting"
>>> +       depends on CGROUP_MEM_RES_CTLR
>>> +       default y
>>> +       help
>>> +         The Kernel Memory extension for Memory Resource Controller can limit
>>> +         the amount of memory used by kernel objects in the system. Those are
>>> +         fundamentally different from the entities handled by the standard
>>> +         Memory Controller, which are page-based, and can be swapped. Users of
>>> +         the kmem extension can use it to guarantee that no group of processes
>>> +         will ever exhaust kernel resources alone.
>>>
>>>   config CGROUP_PERF
>>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ebd1e86..d32e931 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>>   #define do_swap_account               (0)
>>>   #endif
>>>
>>> -
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +int do_kmem_account __read_mostly = 1;
>>> +#else
>>> +#define do_kmem_account                0
>>> +#endif
>>>   /*
>>>    * Statistics for memory cgroup.
>>>    */
>>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>>          */
>>>         struct res_counter memsw;
>>>         /*
>>> +        * the counter to account for kmem usage.
>>> +        */
>>> +       struct res_counter kmem;
>>> +       /*
>>>          * Per cgroup active and inactive list, similar to the
>>>          * per zone LRU lists.
>>>          */
>>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>>          */
>>>         unsigned long   move_charge_at_immigrate;
>>>         /*
>>> +        * Should kernel memory limits be stabilished independently
>>> +        * from user memory ?
>>> +        */
>>> +       int             kmem_independent;
>>> +       /*
>>>          * percpu counter.
>>>          */
>>>         struct mem_cgroup_stat_cpu *stat;
>>> @@ -388,9 +401,14 @@ enum charge_type {
>>>   };
>>>
>>>   /* for encoding cft->private value on file */
>>> -#define _MEM                   (0)
>>> -#define _MEMSWAP               (1)
>>> -#define _OOM_TYPE              (2)
>>> +
>>> +enum mem_type {
>>> +       _MEM = 0,
>>> +       _MEMSWAP,
>>> +       _OOM_TYPE,
>>> +       _KMEM,
>>> +};
>>> +
>>>   #define MEMFILE_PRIVATE(x, val)       (((x)<<    16) | (val))
>>>   #define MEMFILE_TYPE(val)     (((val)>>    16)&    0xffff)
>>>   #define MEMFILE_ATTR(val)     ((val)&    0xffff)
>>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>>         u64 val;
>>>
>>>         if (!mem_cgroup_is_root(mem)) {
>>> +               val = 0;
>>> +               if (!mem->kmem_independent)
>>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>>                 if (!swap)
>>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>>                 else
>>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +
>>> +               return val;
>>>         }
>>>
>>>         val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>>                 else
>>>                         val = res_counter_read_u64(&mem->memsw, name);
>>>                 break;
>>> +       case _KMEM:
>>> +               val = res_counter_read_u64(&mem->kmem, name);
>>> +               break;
>>> +
>>>         default:
>>>                 BUG();
>>>                 break;
>>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>>         return 0;
>>>   }
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>>> +{
>>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>>> +}
>>> +
>>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>>> +                                       u64 val)
>>> +{
>>> +       cgroup_lock();
>>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>>> +       cgroup_unlock();
>>> +       return 0;
>>> +}
>
> I know we have a lot of pending xxx_from_cont() and struct cgroup
> *cont, can we move it to memcg notation to be more consistent with our
> usage. There is a patch to convert old usage
>
>>> +#endif
>>>
>>>   static struct cftype mem_cgroup_files[] = {
>>>         {
>>> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>>   }
>>>   #endif
>>>
>>> +
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static struct cftype kmem_cgroup_files[] = {
>>> +       {
>>> +               .name = "independent_kmem_limit",
>>> +               .read_u64 = kmem_limit_independent_read,
>>> +               .write_u64 = kmem_limit_independent_write,
>>> +       },
>>> +       {
>>> +               .name = "kmem.usage_in_bytes",
>>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
>>> +               .read_u64 = mem_cgroup_read,
>>> +       },
>>> +       {
>>> +               .name = "kmem.limit_in_bytes",
>>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
>>> +               .read_u64 = mem_cgroup_read,
>>> +       },
>>> +};
>>> +
>>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>> +{
>>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>>> +       int ret = 0;
>>> +
>>> +       if (!do_kmem_account)
>>> +               return 0;
>>> +
>>> +       if (!mem_cgroup_is_root(mem))
>>> +               ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
>>> +                                       ARRAY_SIZE(kmem_cgroup_files));
>>> +       return ret;
>>> +};
>>> +
>>> +#else
>>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>> +{
>>> +       return 0;
>>> +}
>>> +#endif
>>> +
>>>   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>>   {
>>>         struct mem_cgroup_per_node *pn;
>>> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>>         if (parent&&    parent->use_hierarchy) {
>>>                 res_counter_init(&mem->res,&parent->res);
>>>                 res_counter_init(&mem->memsw,&parent->memsw);
>>> +               res_counter_init(&mem->kmem,&parent->kmem);
>>>                 /*
>>>                  * We increment refcnt of the parent to ensure that we can
>>>                  * safely access it on res_counter_charge/uncharge.
>>> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>>         } else {
>>>                 res_counter_init(&mem->res, NULL);
>>>                 res_counter_init(&mem->memsw, NULL);
>>> +               res_counter_init(&mem->kmem, NULL);
>>>         }
>>>         mem->last_scanned_child = 0;
>>>         mem->last_scanned_node = MAX_NUMNODES;
>>> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>>>
>>>         if (!ret)
>>>                 ret = register_memsw_files(cont, ss);
>>> +
>>> +       if (!ret)
>>> +               ret = register_kmem_files(cont, ss);
>>> +
>>>         return ret;
>>>   }
>>>
>>> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>>>   __setup("swapaccount=", enable_swap_account);
>>>
>>>   #endif
>>> +
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static int __init disable_kmem_account(char *s)
>>> +{
>>> +       /* consider enabled if no parameter or 1 is given */
>>> +       if (!strcmp(s, "1"))
>>> +               do_kmem_account = 1;
>>> +       else if (!strcmp(s, "0"))
>>> +               do_kmem_account = 0;
>>> +       return 1;
>>> +}
>>> +__setup("kmemaccount=", disable_kmem_account);
>>> +
>>> +#endif
>
> The infrastructure looks OK, we need better integration with
> statistics for kmem usage.
>
> Balbir Singh
Hello Balbir,

Thank you for your comments.
I agree here. With this patch, however, I am only trying to lay down the 
foundations needed for the rest of the patches, that touch tcp memory 
pressure conditions.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-22  3:19         ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-22  3:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On 09/22/2011 12:17 AM, Balbir Singh wrote:
> On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Hi people,
>>
>> Any insights on this series?
>> Kame, is it inline with your expectations ?
>>
>> Thank you all
>>
>> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>>
>>> This patch lays down the foundation for the kernel memory component
>>> of the Memory Controller.
>>>
>>> As of today, I am only laying down the following files:
>>>
>>>   * memory.independent_kmem_limit
>>>   * memory.kmem.limit_in_bytes (currently ignored)
>>>   * memory.kmem.usage_in_bytes (always zero)
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: Paul Menage<paul@paulmenage.org>
>>> CC: Greg Thelen<gthelen@google.com>
>>> ---
>>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>>   init/Kconfig                     |   11 ++++
>>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 6f3c598..6f1954a 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -44,8 +44,9 @@ Features:
>>>    - oom-killer disable knob and oom-notifier
>>>    - Root cgroup has no limit controls.
>>>
>>> - Kernel memory and Hugepages are not under control yet. We just manage
>>> - pages on LRU. To add more controls, we have to take care of performance.
>>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>>> + controls, we have to take care of performance. Kernel memory support is work
>>> + in progress, and the current version provides basically functionality.
>>>
>>>   Brief summary of control files.
>>>
>>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>>                                  (See 5.5 for details)
>>>    memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>>                                  (See 5.5 for details)
>>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>>> +                                (See 2.7 for details)
>>>    memory.limit_in_bytes                 # set/show limit of memory usage
>>>    memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>>    memory.failcnt                        # show the number of memory usage hits limits
>>>    memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>>    memory.max_usage_in_bytes     # show max memory usage recorded
>>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>>    memory.oom_control            # set/show oom controls.
>>>    memory.numa_stat              # show the number of memory usage per numa node
>>>
>>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>> +                                  independent of user limits
>>> +
>>>   1. History
>>>
>>>   The memory controller has a long history. A request for comments for the memory
>>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>>     zone->lru_lock, it has no lock of its own.
>>>
>>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>> +
>>> + With the Kernel memory extension, the Memory Controller is able to limit
>>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>>> +different than user memory, since it can't be swapped out, which makes it
>>> +possible to DoS the system by consuming too much of this precious resource.
>>> +Kernel memory limits are not imposed for the root cgroup.
>>> +
>>> +Memory limits as specified by the standard Memory Controller may or may not
>>> +take kernel memory into consideration. This is achieved through the file
>>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>>> +memory to be controlled separately.
>>> +
>>> +When kernel memory limits are not independent, the limit values set in
>>> +memory.kmem files are ignored.
>>> +
>>> +Currently no soft limit is implemented for kernel memory. It is future work
>>> +to trigger slab reclaim when those limits are reached.
>>> +
>
> Ying Han was also looking into this (cc'ing her)
>
>>>   3. User Interface
>>>
>>>   0. Configuration
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..49e5839 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           For those who want to have the feature enabled by default should
>>>           select this option (if, for some reason, they need to disable it
>>>           then swapaccount=0 does the trick).
>>> +config CGROUP_MEM_RES_CTLR_KMEM
>>> +       bool "Memory Resource Controller Kernel Memory accounting"
>>> +       depends on CGROUP_MEM_RES_CTLR
>>> +       default y
>>> +       help
>>> +         The Kernel Memory extension for Memory Resource Controller can limit
>>> +         the amount of memory used by kernel objects in the system. Those are
>>> +         fundamentally different from the entities handled by the standard
>>> +         Memory Controller, which are page-based, and can be swapped. Users of
>>> +         the kmem extension can use it to guarantee that no group of processes
>>> +         will ever exhaust kernel resources alone.
>>>
>>>   config CGROUP_PERF
>>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ebd1e86..d32e931 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>>   #define do_swap_account               (0)
>>>   #endif
>>>
>>> -
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +int do_kmem_account __read_mostly = 1;
>>> +#else
>>> +#define do_kmem_account                0
>>> +#endif
>>>   /*
>>>    * Statistics for memory cgroup.
>>>    */
>>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>>          */
>>>         struct res_counter memsw;
>>>         /*
>>> +        * the counter to account for kmem usage.
>>> +        */
>>> +       struct res_counter kmem;
>>> +       /*
>>>          * Per cgroup active and inactive list, similar to the
>>>          * per zone LRU lists.
>>>          */
>>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>>          */
>>>         unsigned long   move_charge_at_immigrate;
>>>         /*
>>> +        * Should kernel memory limits be stabilished independently
>>> +        * from user memory ?
>>> +        */
>>> +       int             kmem_independent;
>>> +       /*
>>>          * percpu counter.
>>>          */
>>>         struct mem_cgroup_stat_cpu *stat;
>>> @@ -388,9 +401,14 @@ enum charge_type {
>>>   };
>>>
>>>   /* for encoding cft->private value on file */
>>> -#define _MEM                   (0)
>>> -#define _MEMSWAP               (1)
>>> -#define _OOM_TYPE              (2)
>>> +
>>> +enum mem_type {
>>> +       _MEM = 0,
>>> +       _MEMSWAP,
>>> +       _OOM_TYPE,
>>> +       _KMEM,
>>> +};
>>> +
>>>   #define MEMFILE_PRIVATE(x, val)       (((x)<<    16) | (val))
>>>   #define MEMFILE_TYPE(val)     (((val)>>    16)&    0xffff)
>>>   #define MEMFILE_ATTR(val)     ((val)&    0xffff)
>>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>>         u64 val;
>>>
>>>         if (!mem_cgroup_is_root(mem)) {
>>> +               val = 0;
>>> +               if (!mem->kmem_independent)
>>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>>                 if (!swap)
>>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>>                 else
>>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +
>>> +               return val;
>>>         }
>>>
>>>         val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>>                 else
>>>                         val = res_counter_read_u64(&mem->memsw, name);
>>>                 break;
>>> +       case _KMEM:
>>> +               val = res_counter_read_u64(&mem->kmem, name);
>>> +               break;
>>> +
>>>         default:
>>>                 BUG();
>>>                 break;
>>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>>         return 0;
>>>   }
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>>> +{
>>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>>> +}
>>> +
>>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>>> +                                       u64 val)
>>> +{
>>> +       cgroup_lock();
>>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>>> +       cgroup_unlock();
>>> +       return 0;
>>> +}
>
> I know we have a lot of pending xxx_from_cont() and struct cgroup
> *cont, can we move it to memcg notation to be more consistent with our
> usage. There is a patch to convert old usage
>
>>> +#endif
>>>
>>>   static struct cftype mem_cgroup_files[] = {
>>>         {
>>> @@ -4877,6 +4919,47 @@ static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>>   }
>>>   #endif
>>>
>>> +
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static struct cftype kmem_cgroup_files[] = {
>>> +       {
>>> +               .name = "independent_kmem_limit",
>>> +               .read_u64 = kmem_limit_independent_read,
>>> +               .write_u64 = kmem_limit_independent_write,
>>> +       },
>>> +       {
>>> +               .name = "kmem.usage_in_bytes",
>>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
>>> +               .read_u64 = mem_cgroup_read,
>>> +       },
>>> +       {
>>> +               .name = "kmem.limit_in_bytes",
>>> +               .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
>>> +               .read_u64 = mem_cgroup_read,
>>> +       },
>>> +};
>>> +
>>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>> +{
>>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>>> +       int ret = 0;
>>> +
>>> +       if (!do_kmem_account)
>>> +               return 0;
>>> +
>>> +       if (!mem_cgroup_is_root(mem))
>>> +               ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
>>> +                                       ARRAY_SIZE(kmem_cgroup_files));
>>> +       return ret;
>>> +};
>>> +
>>> +#else
>>> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>>> +{
>>> +       return 0;
>>> +}
>>> +#endif
>>> +
>>>   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>>>   {
>>>         struct mem_cgroup_per_node *pn;
>>> @@ -5075,6 +5158,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>>         if (parent&&    parent->use_hierarchy) {
>>>                 res_counter_init(&mem->res,&parent->res);
>>>                 res_counter_init(&mem->memsw,&parent->memsw);
>>> +               res_counter_init(&mem->kmem,&parent->kmem);
>>>                 /*
>>>                  * We increment refcnt of the parent to ensure that we can
>>>                  * safely access it on res_counter_charge/uncharge.
>>> @@ -5085,6 +5169,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>>         } else {
>>>                 res_counter_init(&mem->res, NULL);
>>>                 res_counter_init(&mem->memsw, NULL);
>>> +               res_counter_init(&mem->kmem, NULL);
>>>         }
>>>         mem->last_scanned_child = 0;
>>>         mem->last_scanned_node = MAX_NUMNODES;
>>> @@ -5129,6 +5214,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>>>
>>>         if (!ret)
>>>                 ret = register_memsw_files(cont, ss);
>>> +
>>> +       if (!ret)
>>> +               ret = register_kmem_files(cont, ss);
>>> +
>>>         return ret;
>>>   }
>>>
>>> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>>>   __setup("swapaccount=", enable_swap_account);
>>>
>>>   #endif
>>> +
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static int __init disable_kmem_account(char *s)
>>> +{
>>> +       /* consider enabled if no parameter or 1 is given */
>>> +       if (!strcmp(s, "1"))
>>> +               do_kmem_account = 1;
>>> +       else if (!strcmp(s, "0"))
>>> +               do_kmem_account = 0;
>>> +       return 1;
>>> +}
>>> +__setup("kmemaccount=", disable_kmem_account);
>>> +
>>> +#endif
>
> The infrastructure looks OK, we need better integration with
> statistics for kmem usage.
>
> Balbir Singh
Hello Balbir,

Thank you for your comments.
I agree here. With this patch, however, I am only trying to lay down the 
foundations needed for the rest of the patches, that touch tcp memory 
pressure conditions.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-22  5:58     ` Greg Thelen
  -1 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22  5:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>         */
>        struct res_counter memsw;
>        /*
> +        * the counter to account for kmem usage.
> +        */
> +       struct res_counter kmem;
> +       /*

I don't see this charged, is this used in a later patch in this series?

> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>
>  #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static int __init disable_kmem_account(char *s)

Minor nit.  To be consistent with the other memcg __setup options, I
think this should be renamed to enable_kmem_account().

> +{
> +       /* consider enabled if no parameter or 1 is given */
> +       if (!strcmp(s, "1"))
> +               do_kmem_account = 1;
> +       else if (!strcmp(s, "0"))
> +               do_kmem_account = 0;
> +       return 1;
> +}
> +__setup("kmemaccount=", disable_kmem_account);
> +
> +#endif

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-22  5:58     ` Greg Thelen
  0 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22  5:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>         */
>        struct res_counter memsw;
>        /*
> +        * the counter to account for kmem usage.
> +        */
> +       struct res_counter kmem;
> +       /*

I don't see this charged, is this used in a later patch in this series?

> @@ -5665,3 +5754,17 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>
>  #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static int __init disable_kmem_account(char *s)

Minor nit.  To be consistent with the other memcg __setup options, I
think this should be renamed to enable_kmem_account().

> +{
> +       /* consider enabled if no parameter or 1 is given */
> +       if (!strcmp(s, "1"))
> +               do_kmem_account = 1;
> +       else if (!strcmp(s, "0"))
> +               do_kmem_account = 0;
> +       return 1;
> +}
> +__setup("kmemaccount=", disable_kmem_account);
> +
> +#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-21 18:59       ` Glauber Costa
@ 2011-09-22  6:00         ` Greg Thelen
  -1 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22  6:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa <glommer@parallels.com> wrote:
> Right now I am working under the assumption that tasks are long lived inside
> the cgroup. Migration potentially introduces some nasty locking problems in
> the mem_schedule path.
>
> Also, unless I am missing something, the memcg already has the policy of
> not carrying charges around, probably because of this very same complexity.
>
> True that at least it won't EBUSY you... But I think this is at least a way
> to guarantee that the cgroup under our nose won't disappear in the middle of
> our allocations.

Here's the memcg user page behavior using the same pattern:

1. user page P is allocate by task T in memcg M1
2. T is moved to memcg M2.  The P charge is left behind still charged
to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
M2 if memory.move_charge_at_immigrate=1.
3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
reclaim, then P is recharged to parent(M1).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-22  6:00         ` Greg Thelen
  0 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22  6:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa <glommer@parallels.com> wrote:
> Right now I am working under the assumption that tasks are long lived inside
> the cgroup. Migration potentially introduces some nasty locking problems in
> the mem_schedule path.
>
> Also, unless I am missing something, the memcg already has the policy of
> not carrying charges around, probably because of this very same complexity.
>
> True that at least it won't EBUSY you... But I think this is at least a way
> to guarantee that the cgroup under our nose won't disappear in the middle of
> our allocations.

Here's the memcg user page behavior using the same pattern:

1. user page P is allocate by task T in memcg M1
2. T is moved to memcg M2.  The P charge is left behind still charged
to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
M2 if memory.move_charge_at_immigrate=1.
3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
reclaim, then P is recharged to parent(M1).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-22  6:01     ` Greg Thelen
  -1 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22  6:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> +{
> +       return (mem == root_mem_cgroup);
> +}
> +

Why are you adding a copy of mem_cgroup_is_root().  I see one already
in v3.0.  Was it deleted in a previous patch?

> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> +       struct net *net = current->nsproxy->net_ns;
> +       int i;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;

Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
to sg->tcp_prot_mem[*]?

> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> +       u64 ret;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;

Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
to sg->tcp_max_memory?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-22  6:01     ` Greg Thelen
  0 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22  6:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> +{
> +       return (mem == root_mem_cgroup);
> +}
> +

Why are you adding a copy of mem_cgroup_is_root().  I see one already
in v3.0.  Was it deleted in a previous patch?

> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> +       struct net *net = current->nsproxy->net_ns;
> +       int i;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;

Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
to sg->tcp_prot_mem[*]?

> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> +       u64 ret;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;

Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
to sg->tcp_max_memory?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-22  6:01     ` Greg Thelen
  (?)
@ 2011-09-22  9:58       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 119+ messages in thread
From: Kirill A. Shutemov @ 2011-09-22  9:58 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm

On Wed, Sep 21, 2011 at 11:01:46PM -0700, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> > +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> > +{
> > +       return (mem == root_mem_cgroup);
> > +}
> > +
> 
> Why are you adding a copy of mem_cgroup_is_root().  I see one already
> in v3.0.  Was it deleted in a previous patch?

mem_cgroup_is_root() moved up in the file.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-22  9:58       ` Kirill A. Shutemov
  0 siblings, 0 replies; 119+ messages in thread
From: Kirill A. Shutemov @ 2011-09-22  9:58 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm

On Wed, Sep 21, 2011 at 11:01:46PM -0700, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> > +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> > +{
> > +       return (mem == root_mem_cgroup);
> > +}
> > +
> 
> Why are you adding a copy of mem_cgroup_is_root().  I see one already
> in v3.0.  Was it deleted in a previous patch?

mem_cgroup_is_root() moved up in the file.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-22  9:58       ` Kirill A. Shutemov
  0 siblings, 0 replies; 119+ messages in thread
From: Kirill A. Shutemov @ 2011-09-22  9:58 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm

On Wed, Sep 21, 2011 at 11:01:46PM -0700, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
> > +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> > +{
> > +       return (mem == root_mem_cgroup);
> > +}
> > +
> 
> Why are you adding a copy of mem_cgroup_is_root().  I see one already
> in v3.0.  Was it deleted in a previous patch?

mem_cgroup_is_root() moved up in the file.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-22  6:00         ` Greg Thelen
@ 2011-09-22 15:09           ` Balbir Singh
  -1 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-22 15:09 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm, kirill

On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen <gthelen@google.com> wrote:
> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa <glommer@parallels.com> wrote:
>> Right now I am working under the assumption that tasks are long lived inside
>> the cgroup. Migration potentially introduces some nasty locking problems in
>> the mem_schedule path.
>>
>> Also, unless I am missing something, the memcg already has the policy of
>> not carrying charges around, probably because of this very same complexity.
>>
>> True that at least it won't EBUSY you... But I think this is at least a way
>> to guarantee that the cgroup under our nose won't disappear in the middle of
>> our allocations.
>
> Here's the memcg user page behavior using the same pattern:
>
> 1. user page P is allocate by task T in memcg M1
> 2. T is moved to memcg M2.  The P charge is left behind still charged
> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> M2 if memory.move_charge_at_immigrate=1.
> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> reclaim, then P is recharged to parent(M1).
>

We also have some magic in page_referenced() to remove pages
referenced from different containers. What we do is try not to
penalize a cgroup if another cgroup is referencing this page and the
page under consideration is being reclaimed from the cgroup that
touched it.

Balbir Singh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-22 15:09           ` Balbir Singh
  0 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-22 15:09 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm, kirill

On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen <gthelen@google.com> wrote:
> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa <glommer@parallels.com> wrote:
>> Right now I am working under the assumption that tasks are long lived inside
>> the cgroup. Migration potentially introduces some nasty locking problems in
>> the mem_schedule path.
>>
>> Also, unless I am missing something, the memcg already has the policy of
>> not carrying charges around, probably because of this very same complexity.
>>
>> True that at least it won't EBUSY you... But I think this is at least a way
>> to guarantee that the cgroup under our nose won't disappear in the middle of
>> our allocations.
>
> Here's the memcg user page behavior using the same pattern:
>
> 1. user page P is allocate by task T in memcg M1
> 2. T is moved to memcg M2.  The P charge is left behind still charged
> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> M2 if memory.move_charge_at_immigrate=1.
> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> reclaim, then P is recharged to parent(M1).
>

We also have some magic in page_referenced() to remove pages
referenced from different containers. What we do is try not to
penalize a cgroup if another cgroup is referencing this page and the
page under consideration is being reclaimed from the cgroup that
touched it.

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-22  9:58       ` Kirill A. Shutemov
@ 2011-09-22 15:44         ` Greg Thelen
  -1 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm

On Thu, Sep 22, 2011 at 2:58 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Wed, Sep 21, 2011 at 11:01:46PM -0700, Greg Thelen wrote:
>> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
>> > +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> > +{
>> > +       return (mem == root_mem_cgroup);
>> > +}
>> > +
>>
>> Why are you adding a copy of mem_cgroup_is_root().  I see one already
>> in v3.0.  Was it deleted in a previous patch?
>
> mem_cgroup_is_root() moved up in the file.

Got it.  Thanks.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-22 15:44         ` Greg Thelen
  0 siblings, 0 replies; 119+ messages in thread
From: Greg Thelen @ 2011-09-22 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, netdev, linux-mm

On Thu, Sep 22, 2011 at 2:58 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Wed, Sep 21, 2011 at 11:01:46PM -0700, Greg Thelen wrote:
>> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa <glommer@parallels.com> wrote:
>> > +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> > +{
>> > +       return (mem == root_mem_cgroup);
>> > +}
>> > +
>>
>> Why are you adding a copy of mem_cgroup_is_root().  I see one already
>> in v3.0.  Was it deleted in a previous patch?
>
> mem_cgroup_is_root() moved up in the file.

Got it.  Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-22 23:08     ` Balbir Singh
  -1 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-22 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On Mon, Sep 19, 2011 at 6:26 AM, Glauber Costa <glommer@parallels.com> wrote:
> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> effectively control the amount of kernel memory pinned by a cgroup.
>
> We have to make sure that none of the memory pressure thresholds
> specified in the namespace are bigger than the current cgroup.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  Documentation/cgroups/memory.txt |    1 +
>  include/linux/memcontrol.h       |   10 ++++
>  mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
>  net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
>  4 files changed, 113 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f1954a..1ffde3e 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -78,6 +78,7 @@ Brief summary of control files.
>
>  memory.independent_kmem_limit  # select whether or not kernel memory limits are
>                                   independent of user limits
> + memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
>
>  1. History
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6b8c0c0..2df6db8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
>                         struct cgroup_subsys *ss);
>  void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>                        struct cgroup_subsys *ss);
> +
> +unsigned long tcp_max_memory(struct mem_cgroup *cg);
> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
>  #else
>  /* memcontrol includes sockets.h, that includes memcontrol.h ... */
>  static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
> @@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
>  static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
> +static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
> +{
> +       return 0;
> +}
> +static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
> +{
> +}
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>  #endif /* CONFIG_INET */
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e9b2c7..be5ab89 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -345,6 +345,7 @@ struct mem_cgroup {
>        spinlock_t pcp_counter_lock;
>
>        /* per-cgroup tcp memory pressure knobs */
> +       int tcp_max_memory;

Aren't we better of abstracting this in a different structure?
Including all the tcp parameters in that abstraction and adding that
structure here?

>        atomic_long_t tcp_memory_allocated;
>        struct percpu_counter tcp_sockets_allocated;
>        /* those two are read-mostly, leave them at the end */
> @@ -352,6 +353,11 @@ struct mem_cgroup {
>        int tcp_memory_pressure;
>  };
>
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> +{
> +       return (mem == root_mem_cgroup);
> +}
> +
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> @@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
>        return &sg->tcp_sockets_allocated;
>  }
>
> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);

sg, I'd prefer memcg, does sg stand for socket group?

> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> +       struct net *net = current->nsproxy->net_ns;
> +       int i;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +
> +       /*
> +        * We can't allow more memory than our parents. Since this
> +        * will be tested for all calls, by induction, there is no need
> +        * to test any parent other than our own
> +        * */
> +       if (parent && (val > parent->tcp_max_memory))
> +               val = parent->tcp_max_memory;
> +
> +       sg->tcp_max_memory = val;
> +
> +       for (i = 0; i < 3; i++)
> +               sg->tcp_prot_mem[i]  = min_t(long, val,
> +                                            net->ipv4.sysctl_tcp_mem[i]);
> +
> +       cgroup_unlock();
> +
> +       return 0;
> +}
> +
> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);

sg? We generally use memcg as a convention

> +       u64 ret;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +       ret = sg->tcp_max_memory;
> +
> +       cgroup_unlock();
> +       return ret;
> +}
> +
> +static struct cftype tcp_files[] = {
> +       {
> +               .name = "kmem.tcp.max_memory",
> +               .write_u64 = tcp_write_maxmem,
> +               .read_u64 = tcp_read_maxmem,
> +       },
> +};
> +
>  /*
>  * For ipv6, we only need to fill in the function pointers (can't initialize
>  * things twice). So keep it separated
> @@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>                    struct cgroup_subsys *ss)
>  {
>        struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
> +       struct mem_cgroup *parent = parent_mem_cgroup(cg);
>        unsigned long limit;
>        struct net *net = current->nsproxy->net_ns;
> +       int ret = 0;
>
>        cg->tcp_memory_pressure = 0;
>        atomic_long_set(&cg->tcp_memory_allocated, 0);
> @@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>        limit = nr_free_buffer_pages() / 8;
>        limit = max(limit, 128UL);
>
> +       if (parent)
> +               cg->tcp_max_memory = parent->tcp_max_memory;
> +       else
> +               cg->tcp_max_memory = limit * 2;
> +
>        cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>        cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>        cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>
>        tcp_init_cgroup_fill(prot, cgrp, ss);
> -       return 0;
> +       /*
> +        * For non-root cgroup, we need to set up all tcp-related variables,
> +        * but to be consistent with the rest of kmem management, we don't
> +        * expose any of the controls
> +        */
> +       if (!mem_cgroup_is_root(cg))
> +               ret = cgroup_add_files(cgrp, ss, tcp_files,
> +                                      ARRAY_SIZE(tcp_files));
> +       return ret;
>  }
>  EXPORT_SYMBOL(tcp_init_cgroup);
>
> @@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>        percpu_counter_destroy(&cg->tcp_sockets_allocated);
>  }
>  EXPORT_SYMBOL(tcp_destroy_cgroup);
> +
> +unsigned long tcp_max_memory(struct mem_cgroup *cg)
> +{
> +       return cg->tcp_max_memory;
> +}
> +
> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
> +{
> +       cg->tcp_prot_mem[idx] = val;
> +}
>  #endif /* CONFIG_INET */
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>
> @@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
>  #define for_each_mem_cgroup_all(iter) \
>        for_each_mem_cgroup_tree_cond(iter, NULL, true)
>
> -
> -static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> -{
> -       return (mem == root_mem_cgroup);
> -}
> -
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>  {
>        struct mem_cgroup *mem;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index bbd67ab..cdc35f6 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/nsproxy.h>
> +#include <linux/memcontrol.h>
>  #include <linux/swap.h>
>  #include <net/snmp.h>
>  #include <net/icmp.h>
> @@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>        int ret;
>        unsigned long vec[3];
>        struct net *net = current->nsproxy->net_ns;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +       int i;
> +       struct mem_cgroup *cg;
> +#endif
>
>        ctl_table tmp = {
>                .data = &vec,
> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>        if (ret)
>                return ret;
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +       rcu_read_lock();
> +       cg = mem_cgroup_from_task(current);
> +       for (i = 0; i < 3; i++)
> +               if (vec[i] > tcp_max_memory(cg)) {
> +                       rcu_read_unlock();
> +                       return -EINVAL;
> +               }
> +
> +       tcp_prot_mem(cg, vec[0], 0);
> +       tcp_prot_mem(cg, vec[1], 1);
> +       tcp_prot_mem(cg, vec[2], 2);
> +       rcu_read_unlock();
> +#endif
> +
>        net->ipv4.sysctl_tcp_mem[0] = vec[0];
>        net->ipv4.sysctl_tcp_mem[1] = vec[1];
>        net->ipv4.sysctl_tcp_mem[2] = vec[2];

Balbir Singh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-22 23:08     ` Balbir Singh
  0 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-22 23:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On Mon, Sep 19, 2011 at 6:26 AM, Glauber Costa <glommer@parallels.com> wrote:
> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> effectively control the amount of kernel memory pinned by a cgroup.
>
> We have to make sure that none of the memory pressure thresholds
> specified in the namespace are bigger than the current cgroup.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  Documentation/cgroups/memory.txt |    1 +
>  include/linux/memcontrol.h       |   10 ++++
>  mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
>  net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
>  4 files changed, 113 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f1954a..1ffde3e 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -78,6 +78,7 @@ Brief summary of control files.
>
>  memory.independent_kmem_limit  # select whether or not kernel memory limits are
>                                   independent of user limits
> + memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
>
>  1. History
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6b8c0c0..2df6db8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
>                         struct cgroup_subsys *ss);
>  void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>                        struct cgroup_subsys *ss);
> +
> +unsigned long tcp_max_memory(struct mem_cgroup *cg);
> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
>  #else
>  /* memcontrol includes sockets.h, that includes memcontrol.h ... */
>  static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
> @@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
>  static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
> +static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
> +{
> +       return 0;
> +}
> +static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
> +{
> +}
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>  #endif /* CONFIG_INET */
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e9b2c7..be5ab89 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -345,6 +345,7 @@ struct mem_cgroup {
>        spinlock_t pcp_counter_lock;
>
>        /* per-cgroup tcp memory pressure knobs */
> +       int tcp_max_memory;

Aren't we better of abstracting this in a different structure?
Including all the tcp parameters in that abstraction and adding that
structure here?

>        atomic_long_t tcp_memory_allocated;
>        struct percpu_counter tcp_sockets_allocated;
>        /* those two are read-mostly, leave them at the end */
> @@ -352,6 +353,11 @@ struct mem_cgroup {
>        int tcp_memory_pressure;
>  };
>
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> +{
> +       return (mem == root_mem_cgroup);
> +}
> +
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> @@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
>        return &sg->tcp_sockets_allocated;
>  }
>
> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);

sg, I'd prefer memcg, does sg stand for socket group?

> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> +       struct net *net = current->nsproxy->net_ns;
> +       int i;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +
> +       /*
> +        * We can't allow more memory than our parents. Since this
> +        * will be tested for all calls, by induction, there is no need
> +        * to test any parent other than our own
> +        * */
> +       if (parent && (val > parent->tcp_max_memory))
> +               val = parent->tcp_max_memory;
> +
> +       sg->tcp_max_memory = val;
> +
> +       for (i = 0; i < 3; i++)
> +               sg->tcp_prot_mem[i]  = min_t(long, val,
> +                                            net->ipv4.sysctl_tcp_mem[i]);
> +
> +       cgroup_unlock();
> +
> +       return 0;
> +}
> +
> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);

sg? We generally use memcg as a convention

> +       u64 ret;
> +
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +       ret = sg->tcp_max_memory;
> +
> +       cgroup_unlock();
> +       return ret;
> +}
> +
> +static struct cftype tcp_files[] = {
> +       {
> +               .name = "kmem.tcp.max_memory",
> +               .write_u64 = tcp_write_maxmem,
> +               .read_u64 = tcp_read_maxmem,
> +       },
> +};
> +
>  /*
>  * For ipv6, we only need to fill in the function pointers (can't initialize
>  * things twice). So keep it separated
> @@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>                    struct cgroup_subsys *ss)
>  {
>        struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
> +       struct mem_cgroup *parent = parent_mem_cgroup(cg);
>        unsigned long limit;
>        struct net *net = current->nsproxy->net_ns;
> +       int ret = 0;
>
>        cg->tcp_memory_pressure = 0;
>        atomic_long_set(&cg->tcp_memory_allocated, 0);
> @@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>        limit = nr_free_buffer_pages() / 8;
>        limit = max(limit, 128UL);
>
> +       if (parent)
> +               cg->tcp_max_memory = parent->tcp_max_memory;
> +       else
> +               cg->tcp_max_memory = limit * 2;
> +
>        cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>        cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>        cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>
>        tcp_init_cgroup_fill(prot, cgrp, ss);
> -       return 0;
> +       /*
> +        * For non-root cgroup, we need to set up all tcp-related variables,
> +        * but to be consistent with the rest of kmem management, we don't
> +        * expose any of the controls
> +        */
> +       if (!mem_cgroup_is_root(cg))
> +               ret = cgroup_add_files(cgrp, ss, tcp_files,
> +                                      ARRAY_SIZE(tcp_files));
> +       return ret;
>  }
>  EXPORT_SYMBOL(tcp_init_cgroup);
>
> @@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>        percpu_counter_destroy(&cg->tcp_sockets_allocated);
>  }
>  EXPORT_SYMBOL(tcp_destroy_cgroup);
> +
> +unsigned long tcp_max_memory(struct mem_cgroup *cg)
> +{
> +       return cg->tcp_max_memory;
> +}
> +
> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
> +{
> +       cg->tcp_prot_mem[idx] = val;
> +}
>  #endif /* CONFIG_INET */
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>
> @@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
>  #define for_each_mem_cgroup_all(iter) \
>        for_each_mem_cgroup_tree_cond(iter, NULL, true)
>
> -
> -static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> -{
> -       return (mem == root_mem_cgroup);
> -}
> -
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>  {
>        struct mem_cgroup *mem;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index bbd67ab..cdc35f6 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/nsproxy.h>
> +#include <linux/memcontrol.h>
>  #include <linux/swap.h>
>  #include <net/snmp.h>
>  #include <net/icmp.h>
> @@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>        int ret;
>        unsigned long vec[3];
>        struct net *net = current->nsproxy->net_ns;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +       int i;
> +       struct mem_cgroup *cg;
> +#endif
>
>        ctl_table tmp = {
>                .data = &vec,
> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>        if (ret)
>                return ret;
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +       rcu_read_lock();
> +       cg = mem_cgroup_from_task(current);
> +       for (i = 0; i < 3; i++)
> +               if (vec[i] > tcp_max_memory(cg)) {
> +                       rcu_read_unlock();
> +                       return -EINVAL;
> +               }
> +
> +       tcp_prot_mem(cg, vec[0], 0);
> +       tcp_prot_mem(cg, vec[1], 1);
> +       tcp_prot_mem(cg, vec[2], 2);
> +       rcu_read_unlock();
> +#endif
> +
>        net->ipv4.sysctl_tcp_mem[0] = vec[0];
>        net->ipv4.sysctl_tcp_mem[1] = vec[1];
>        net->ipv4.sysctl_tcp_mem[2] = vec[2];

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-22  6:01     ` Greg Thelen
  (?)
@ 2011-09-24 13:30       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:30 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On 09/22/2011 03:01 AM, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> +{
>> +       return (mem == root_mem_cgroup);
>> +}
>> +
>
> Why are you adding a copy of mem_cgroup_is_root().  I see one already
> in v3.0.  Was it deleted in a previous patch?

Already answered by another good samaritan.

>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>
> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> to sg->tcp_prot_mem[*]?
>
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>
> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> to sg->tcp_max_memory?

No, that is not my understanding. My understanding is this lock is 
needed to protect against the cgroup just disappearing under our nose.

The task reading/writing it is not necessarily inside the cgroup 
(usually it is not...), so the mere fact of opening the file does not 
guarantee the cgroup will be kept alive. So we can grab a pointer - 
cgroup exists - and write to it - cgroup does not exist.

Or am I missing something ?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 13:30       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:30 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On 09/22/2011 03:01 AM, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> +{
>> +       return (mem == root_mem_cgroup);
>> +}
>> +
>
> Why are you adding a copy of mem_cgroup_is_root().  I see one already
> in v3.0.  Was it deleted in a previous patch?

Already answered by another good samaritan.

>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>
> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> to sg->tcp_prot_mem[*]?
>
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>
> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> to sg->tcp_max_memory?

No, that is not my understanding. My understanding is this lock is 
needed to protect against the cgroup just disappearing under our nose.

The task reading/writing it is not necessarily inside the cgroup 
(usually it is not...), so the mere fact of opening the file does not 
guarantee the cgroup will be kept alive. So we can grab a pointer - 
cgroup exists - and write to it - cgroup does not exist.

Or am I missing something ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 13:30       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:30 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	netdev, linux-mm, kirill

On 09/22/2011 03:01 AM, Greg Thelen wrote:
> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> +{
>> +       return (mem == root_mem_cgroup);
>> +}
>> +
>
> Why are you adding a copy of mem_cgroup_is_root().  I see one already
> in v3.0.  Was it deleted in a previous patch?

Already answered by another good samaritan.

>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>
> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> to sg->tcp_prot_mem[*]?
>
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>
> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> to sg->tcp_max_memory?

No, that is not my understanding. My understanding is this lock is 
needed to protect against the cgroup just disappearing under our nose.

The task reading/writing it is not necessarily inside the cgroup 
(usually it is not...), so the mere fact of opening the file does not 
guarantee the cgroup will be kept alive. So we can grab a pointer - 
cgroup exists - and write to it - cgroup does not exist.

Or am I missing something ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-22 15:09           ` Balbir Singh
  (?)
@ 2011-09-24 13:33             ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
humm... Then we need to keep pointers to: 1) Which allocations comes 
from each socket, and 2) Which sockets comes from each task. 2 is pretty 
easy, 1 may get expensive. I will investigate it now.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-24 13:33             ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
humm... Then we need to keep pointers to: 1) Which allocations comes 
from each socket, and 2) Which sockets comes from each task. 2 is pretty 
easy, 1 may get expensive. I will investigate it now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-24 13:33             ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
humm... Then we need to keep pointers to: 1) Which allocations comes 
from each socket, and 2) Which sockets comes from each task. 2 is pretty 
easy, 1 may get expensive. I will investigate it now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-22 23:08     ` Balbir Singh
  (?)
@ 2011-09-24 13:35       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/22/2011 08:08 PM, Balbir Singh wrote:
> On Mon, Sep 19, 2011 at 6:26 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   Documentation/cgroups/memory.txt |    1 +
>>   include/linux/memcontrol.h       |   10 ++++
>>   mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
>>   net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
>>   4 files changed, 113 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f1954a..1ffde3e 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -78,6 +78,7 @@ Brief summary of control files.
>>
>>   memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>                                    independent of user limits
>> + memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
>>
>>   1. History
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 6b8c0c0..2df6db8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
>>                          struct cgroup_subsys *ss);
>>   void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>>                         struct cgroup_subsys *ss);
>> +
>> +unsigned long tcp_max_memory(struct mem_cgroup *cg);
>> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
>>   #else
>>   /* memcontrol includes sockets.h, that includes memcontrol.h ... */
>>   static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
>> @@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
>>   static inline void sock_release_memcg(struct sock *sk)
>>   {
>>   }
>> +static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
>> +{
>> +       return 0;
>> +}
>> +static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
>> +{
>> +}
>>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>   #endif /* CONFIG_INET */
>>   #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 5e9b2c7..be5ab89 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -345,6 +345,7 @@ struct mem_cgroup {
>>         spinlock_t pcp_counter_lock;
>>
>>         /* per-cgroup tcp memory pressure knobs */
>> +       int tcp_max_memory;
>
> Aren't we better of abstracting this in a different structure?
> Including all the tcp parameters in that abstraction and adding that
> structure here?

Humm, I think so, yes.


>>         atomic_long_t tcp_memory_allocated;
>>         struct percpu_counter tcp_sockets_allocated;
>>         /* those two are read-mostly, leave them at the end */
>> @@ -352,6 +353,11 @@ struct mem_cgroup {
>>         int tcp_memory_pressure;
>>   };
>>
>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> +{
>> +       return (mem == root_mem_cgroup);
>> +}
>> +
>>   static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>>   /* Writing them here to avoid exposing memcg's inner layout */
>>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> @@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
>>         return&sg->tcp_sockets_allocated;
>>   }
>>
>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>
> sg, I'd prefer memcg, does sg stand for socket group?
>
>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +
>> +       /*
>> +        * We can't allow more memory than our parents. Since this
>> +        * will be tested for all calls, by induction, there is no need
>> +        * to test any parent other than our own
>> +        * */
>> +       if (parent&&  (val>  parent->tcp_max_memory))
>> +               val = parent->tcp_max_memory;
>> +
>> +       sg->tcp_max_memory = val;
>> +
>> +       for (i = 0; i<  3; i++)
>> +               sg->tcp_prot_mem[i]  = min_t(long, val,
>> +                                            net->ipv4.sysctl_tcp_mem[i]);
>> +
>> +       cgroup_unlock();
>> +
>> +       return 0;
>> +}
>> +
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>
> sg? We generally use memcg as a convention
>
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +       ret = sg->tcp_max_memory;
>> +
>> +       cgroup_unlock();
>> +       return ret;
>> +}
>> +
>> +static struct cftype tcp_files[] = {
>> +       {
>> +               .name = "kmem.tcp.max_memory",
>> +               .write_u64 = tcp_write_maxmem,
>> +               .read_u64 = tcp_read_maxmem,
>> +       },
>> +};
>> +
>>   /*
>>   * For ipv6, we only need to fill in the function pointers (can't initialize
>>   * things twice). So keep it separated
>> @@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>>                     struct cgroup_subsys *ss)
>>   {
>>         struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +       struct mem_cgroup *parent = parent_mem_cgroup(cg);
>>         unsigned long limit;
>>         struct net *net = current->nsproxy->net_ns;
>> +       int ret = 0;
>>
>>         cg->tcp_memory_pressure = 0;
>>         atomic_long_set(&cg->tcp_memory_allocated, 0);
>> @@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>>         limit = nr_free_buffer_pages() / 8;
>>         limit = max(limit, 128UL);
>>
>> +       if (parent)
>> +               cg->tcp_max_memory = parent->tcp_max_memory;
>> +       else
>> +               cg->tcp_max_memory = limit * 2;
>> +
>>         cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>>         cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>>         cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>>
>>         tcp_init_cgroup_fill(prot, cgrp, ss);
>> -       return 0;
>> +       /*
>> +        * For non-root cgroup, we need to set up all tcp-related variables,
>> +        * but to be consistent with the rest of kmem management, we don't
>> +        * expose any of the controls
>> +        */
>> +       if (!mem_cgroup_is_root(cg))
>> +               ret = cgroup_add_files(cgrp, ss, tcp_files,
>> +                                      ARRAY_SIZE(tcp_files));
>> +       return ret;
>>   }
>>   EXPORT_SYMBOL(tcp_init_cgroup);
>>
>> @@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>>         percpu_counter_destroy(&cg->tcp_sockets_allocated);
>>   }
>>   EXPORT_SYMBOL(tcp_destroy_cgroup);
>> +
>> +unsigned long tcp_max_memory(struct mem_cgroup *cg)
>> +{
>> +       return cg->tcp_max_memory;
>> +}
>> +
>> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
>> +{
>> +       cg->tcp_prot_mem[idx] = val;
>> +}
>>   #endif /* CONFIG_INET */
>>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>
>> @@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
>>   #define for_each_mem_cgroup_all(iter) \
>>         for_each_mem_cgroup_tree_cond(iter, NULL, true)
>>
>> -
>> -static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> -{
>> -       return (mem == root_mem_cgroup);
>> -}
>> -
>>   void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>>   {
>>         struct mem_cgroup *mem;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index bbd67ab..cdc35f6 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/memcontrol.h>
>>   #include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>> @@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>         int ret;
>>         unsigned long vec[3];
>>         struct net *net = current->nsproxy->net_ns;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       int i;
>> +       struct mem_cgroup *cg;
>> +#endif
>>
>>         ctl_table tmp = {
>>                 .data =&vec,
>> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>         if (ret)
>>                 return ret;
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       rcu_read_lock();
>> +       cg = mem_cgroup_from_task(current);
>> +       for (i = 0; i<  3; i++)
>> +               if (vec[i]>  tcp_max_memory(cg)) {
>> +                       rcu_read_unlock();
>> +                       return -EINVAL;
>> +               }
>> +
>> +       tcp_prot_mem(cg, vec[0], 0);
>> +       tcp_prot_mem(cg, vec[1], 1);
>> +       tcp_prot_mem(cg, vec[2], 2);
>> +       rcu_read_unlock();
>> +#endif
>> +
>>         net->ipv4.sysctl_tcp_mem[0] = vec[0];
>>         net->ipv4.sysctl_tcp_mem[1] = vec[1];
>>         net->ipv4.sysctl_tcp_mem[2] = vec[2];
>
> Balbir Singh


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 13:35       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/22/2011 08:08 PM, Balbir Singh wrote:
> On Mon, Sep 19, 2011 at 6:26 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   Documentation/cgroups/memory.txt |    1 +
>>   include/linux/memcontrol.h       |   10 ++++
>>   mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
>>   net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
>>   4 files changed, 113 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f1954a..1ffde3e 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -78,6 +78,7 @@ Brief summary of control files.
>>
>>   memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>                                    independent of user limits
>> + memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
>>
>>   1. History
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 6b8c0c0..2df6db8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
>>                          struct cgroup_subsys *ss);
>>   void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>>                         struct cgroup_subsys *ss);
>> +
>> +unsigned long tcp_max_memory(struct mem_cgroup *cg);
>> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
>>   #else
>>   /* memcontrol includes sockets.h, that includes memcontrol.h ... */
>>   static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
>> @@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
>>   static inline void sock_release_memcg(struct sock *sk)
>>   {
>>   }
>> +static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
>> +{
>> +       return 0;
>> +}
>> +static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
>> +{
>> +}
>>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>   #endif /* CONFIG_INET */
>>   #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 5e9b2c7..be5ab89 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -345,6 +345,7 @@ struct mem_cgroup {
>>         spinlock_t pcp_counter_lock;
>>
>>         /* per-cgroup tcp memory pressure knobs */
>> +       int tcp_max_memory;
>
> Aren't we better of abstracting this in a different structure?
> Including all the tcp parameters in that abstraction and adding that
> structure here?

Humm, I think so, yes.


>>         atomic_long_t tcp_memory_allocated;
>>         struct percpu_counter tcp_sockets_allocated;
>>         /* those two are read-mostly, leave them at the end */
>> @@ -352,6 +353,11 @@ struct mem_cgroup {
>>         int tcp_memory_pressure;
>>   };
>>
>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> +{
>> +       return (mem == root_mem_cgroup);
>> +}
>> +
>>   static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>>   /* Writing them here to avoid exposing memcg's inner layout */
>>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> @@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
>>         return&sg->tcp_sockets_allocated;
>>   }
>>
>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>
> sg, I'd prefer memcg, does sg stand for socket group?
>
>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +
>> +       /*
>> +        * We can't allow more memory than our parents. Since this
>> +        * will be tested for all calls, by induction, there is no need
>> +        * to test any parent other than our own
>> +        * */
>> +       if (parent&&  (val>  parent->tcp_max_memory))
>> +               val = parent->tcp_max_memory;
>> +
>> +       sg->tcp_max_memory = val;
>> +
>> +       for (i = 0; i<  3; i++)
>> +               sg->tcp_prot_mem[i]  = min_t(long, val,
>> +                                            net->ipv4.sysctl_tcp_mem[i]);
>> +
>> +       cgroup_unlock();
>> +
>> +       return 0;
>> +}
>> +
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>
> sg? We generally use memcg as a convention
>
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +       ret = sg->tcp_max_memory;
>> +
>> +       cgroup_unlock();
>> +       return ret;
>> +}
>> +
>> +static struct cftype tcp_files[] = {
>> +       {
>> +               .name = "kmem.tcp.max_memory",
>> +               .write_u64 = tcp_write_maxmem,
>> +               .read_u64 = tcp_read_maxmem,
>> +       },
>> +};
>> +
>>   /*
>>   * For ipv6, we only need to fill in the function pointers (can't initialize
>>   * things twice). So keep it separated
>> @@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>>                     struct cgroup_subsys *ss)
>>   {
>>         struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +       struct mem_cgroup *parent = parent_mem_cgroup(cg);
>>         unsigned long limit;
>>         struct net *net = current->nsproxy->net_ns;
>> +       int ret = 0;
>>
>>         cg->tcp_memory_pressure = 0;
>>         atomic_long_set(&cg->tcp_memory_allocated, 0);
>> @@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>>         limit = nr_free_buffer_pages() / 8;
>>         limit = max(limit, 128UL);
>>
>> +       if (parent)
>> +               cg->tcp_max_memory = parent->tcp_max_memory;
>> +       else
>> +               cg->tcp_max_memory = limit * 2;
>> +
>>         cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>>         cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>>         cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>>
>>         tcp_init_cgroup_fill(prot, cgrp, ss);
>> -       return 0;
>> +       /*
>> +        * For non-root cgroup, we need to set up all tcp-related variables,
>> +        * but to be consistent with the rest of kmem management, we don't
>> +        * expose any of the controls
>> +        */
>> +       if (!mem_cgroup_is_root(cg))
>> +               ret = cgroup_add_files(cgrp, ss, tcp_files,
>> +                                      ARRAY_SIZE(tcp_files));
>> +       return ret;
>>   }
>>   EXPORT_SYMBOL(tcp_init_cgroup);
>>
>> @@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>>         percpu_counter_destroy(&cg->tcp_sockets_allocated);
>>   }
>>   EXPORT_SYMBOL(tcp_destroy_cgroup);
>> +
>> +unsigned long tcp_max_memory(struct mem_cgroup *cg)
>> +{
>> +       return cg->tcp_max_memory;
>> +}
>> +
>> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
>> +{
>> +       cg->tcp_prot_mem[idx] = val;
>> +}
>>   #endif /* CONFIG_INET */
>>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>
>> @@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
>>   #define for_each_mem_cgroup_all(iter) \
>>         for_each_mem_cgroup_tree_cond(iter, NULL, true)
>>
>> -
>> -static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> -{
>> -       return (mem == root_mem_cgroup);
>> -}
>> -
>>   void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>>   {
>>         struct mem_cgroup *mem;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index bbd67ab..cdc35f6 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/memcontrol.h>
>>   #include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>> @@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>         int ret;
>>         unsigned long vec[3];
>>         struct net *net = current->nsproxy->net_ns;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       int i;
>> +       struct mem_cgroup *cg;
>> +#endif
>>
>>         ctl_table tmp = {
>>                 .data =&vec,
>> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>         if (ret)
>>                 return ret;
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       rcu_read_lock();
>> +       cg = mem_cgroup_from_task(current);
>> +       for (i = 0; i<  3; i++)
>> +               if (vec[i]>  tcp_max_memory(cg)) {
>> +                       rcu_read_unlock();
>> +                       return -EINVAL;
>> +               }
>> +
>> +       tcp_prot_mem(cg, vec[0], 0);
>> +       tcp_prot_mem(cg, vec[1], 1);
>> +       tcp_prot_mem(cg, vec[2], 2);
>> +       rcu_read_unlock();
>> +#endif
>> +
>>         net->ipv4.sysctl_tcp_mem[0] = vec[0];
>>         net->ipv4.sysctl_tcp_mem[1] = vec[1];
>>         net->ipv4.sysctl_tcp_mem[2] = vec[2];
>
> Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 13:35       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/22/2011 08:08 PM, Balbir Singh wrote:
> On Mon, Sep 19, 2011 at 6:26 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   Documentation/cgroups/memory.txt |    1 +
>>   include/linux/memcontrol.h       |   10 ++++
>>   mm/memcontrol.c                  |   89 +++++++++++++++++++++++++++++++++++---
>>   net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++
>>   4 files changed, 113 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f1954a..1ffde3e 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -78,6 +78,7 @@ Brief summary of control files.
>>
>>   memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>                                    independent of user limits
>> + memory.kmem.tcp.max_memory      # set/show hard limit for tcp buf memory
>>
>>   1. History
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 6b8c0c0..2df6db8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -416,6 +416,9 @@ int tcp_init_cgroup_fill(struct proto *prot, struct cgroup *cgrp,
>>                          struct cgroup_subsys *ss);
>>   void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>>                         struct cgroup_subsys *ss);
>> +
>> +unsigned long tcp_max_memory(struct mem_cgroup *cg);
>> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx);
>>   #else
>>   /* memcontrol includes sockets.h, that includes memcontrol.h ... */
>>   static inline void memcg_sock_mem_alloc(struct mem_cgroup *mem,
>> @@ -441,6 +444,13 @@ static inline void sock_update_memcg(struct sock *sk)
>>   static inline void sock_release_memcg(struct sock *sk)
>>   {
>>   }
>> +static inline unsigned long tcp_max_memory(struct mem_cgroup *cg)
>> +{
>> +       return 0;
>> +}
>> +static inline void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
>> +{
>> +}
>>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>   #endif /* CONFIG_INET */
>>   #endif /* _LINUX_MEMCONTROL_H */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 5e9b2c7..be5ab89 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -345,6 +345,7 @@ struct mem_cgroup {
>>         spinlock_t pcp_counter_lock;
>>
>>         /* per-cgroup tcp memory pressure knobs */
>> +       int tcp_max_memory;
>
> Aren't we better of abstracting this in a different structure?
> Including all the tcp parameters in that abstraction and adding that
> structure here?

Humm, I think so, yes.


>>         atomic_long_t tcp_memory_allocated;
>>         struct percpu_counter tcp_sockets_allocated;
>>         /* those two are read-mostly, leave them at the end */
>> @@ -352,6 +353,11 @@ struct mem_cgroup {
>>         int tcp_memory_pressure;
>>   };
>>
>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> +{
>> +       return (mem == root_mem_cgroup);
>> +}
>> +
>>   static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
>>   /* Writing them here to avoid exposing memcg's inner layout */
>>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> @@ -466,6 +472,56 @@ struct percpu_counter *sockets_allocated_tcp(struct mem_cgroup *sg)
>>         return&sg->tcp_sockets_allocated;
>>   }
>>
>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>
> sg, I'd prefer memcg, does sg stand for socket group?
>
>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>> +       struct net *net = current->nsproxy->net_ns;
>> +       int i;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +
>> +       /*
>> +        * We can't allow more memory than our parents. Since this
>> +        * will be tested for all calls, by induction, there is no need
>> +        * to test any parent other than our own
>> +        * */
>> +       if (parent&&  (val>  parent->tcp_max_memory))
>> +               val = parent->tcp_max_memory;
>> +
>> +       sg->tcp_max_memory = val;
>> +
>> +       for (i = 0; i<  3; i++)
>> +               sg->tcp_prot_mem[i]  = min_t(long, val,
>> +                                            net->ipv4.sysctl_tcp_mem[i]);
>> +
>> +       cgroup_unlock();
>> +
>> +       return 0;
>> +}
>> +
>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>
> sg? We generally use memcg as a convention
>
>> +       u64 ret;
>> +
>> +       if (!cgroup_lock_live_group(cgrp))
>> +               return -ENODEV;
>> +       ret = sg->tcp_max_memory;
>> +
>> +       cgroup_unlock();
>> +       return ret;
>> +}
>> +
>> +static struct cftype tcp_files[] = {
>> +       {
>> +               .name = "kmem.tcp.max_memory",
>> +               .write_u64 = tcp_write_maxmem,
>> +               .read_u64 = tcp_read_maxmem,
>> +       },
>> +};
>> +
>>   /*
>>   * For ipv6, we only need to fill in the function pointers (can't initialize
>>   * things twice). So keep it separated
>> @@ -487,8 +543,10 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>>                     struct cgroup_subsys *ss)
>>   {
>>         struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +       struct mem_cgroup *parent = parent_mem_cgroup(cg);
>>         unsigned long limit;
>>         struct net *net = current->nsproxy->net_ns;
>> +       int ret = 0;
>>
>>         cg->tcp_memory_pressure = 0;
>>         atomic_long_set(&cg->tcp_memory_allocated, 0);
>> @@ -497,12 +555,25 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>>         limit = nr_free_buffer_pages() / 8;
>>         limit = max(limit, 128UL);
>>
>> +       if (parent)
>> +               cg->tcp_max_memory = parent->tcp_max_memory;
>> +       else
>> +               cg->tcp_max_memory = limit * 2;
>> +
>>         cg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
>>         cg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
>>         cg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
>>
>>         tcp_init_cgroup_fill(prot, cgrp, ss);
>> -       return 0;
>> +       /*
>> +        * For non-root cgroup, we need to set up all tcp-related variables,
>> +        * but to be consistent with the rest of kmem management, we don't
>> +        * expose any of the controls
>> +        */
>> +       if (!mem_cgroup_is_root(cg))
>> +               ret = cgroup_add_files(cgrp, ss, tcp_files,
>> +                                      ARRAY_SIZE(tcp_files));
>> +       return ret;
>>   }
>>   EXPORT_SYMBOL(tcp_init_cgroup);
>>
>> @@ -514,6 +585,16 @@ void tcp_destroy_cgroup(struct proto *prot, struct cgroup *cgrp,
>>         percpu_counter_destroy(&cg->tcp_sockets_allocated);
>>   }
>>   EXPORT_SYMBOL(tcp_destroy_cgroup);
>> +
>> +unsigned long tcp_max_memory(struct mem_cgroup *cg)
>> +{
>> +       return cg->tcp_max_memory;
>> +}
>> +
>> +void tcp_prot_mem(struct mem_cgroup *cg, long val, int idx)
>> +{
>> +       cg->tcp_prot_mem[idx] = val;
>> +}
>>   #endif /* CONFIG_INET */
>>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>
>> @@ -1092,12 +1173,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
>>   #define for_each_mem_cgroup_all(iter) \
>>         for_each_mem_cgroup_tree_cond(iter, NULL, true)
>>
>> -
>> -static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>> -{
>> -       return (mem == root_mem_cgroup);
>> -}
>> -
>>   void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>>   {
>>         struct mem_cgroup *mem;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index bbd67ab..cdc35f6 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/memcontrol.h>
>>   #include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>> @@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>         int ret;
>>         unsigned long vec[3];
>>         struct net *net = current->nsproxy->net_ns;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       int i;
>> +       struct mem_cgroup *cg;
>> +#endif
>>
>>         ctl_table tmp = {
>>                 .data =&vec,
>> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>>         if (ret)
>>                 return ret;
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       rcu_read_lock();
>> +       cg = mem_cgroup_from_task(current);
>> +       for (i = 0; i<  3; i++)
>> +               if (vec[i]>  tcp_max_memory(cg)) {
>> +                       rcu_read_unlock();
>> +                       return -EINVAL;
>> +               }
>> +
>> +       tcp_prot_mem(cg, vec[0], 0);
>> +       tcp_prot_mem(cg, vec[1], 1);
>> +       tcp_prot_mem(cg, vec[2], 2);
>> +       rcu_read_unlock();
>> +#endif
>> +
>>         net->ipv4.sysctl_tcp_mem[0] = vec[0];
>>         net->ipv4.sysctl_tcp_mem[1] = vec[1];
>>         net->ipv4.sysctl_tcp_mem[2] = vec[2];
>
> Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-22 15:09           ` Balbir Singh
  (?)
@ 2011-09-24 13:40             ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
Btw:

This has the same problem we'll face for any kmem related memory in the 
cgroup: We can't just force reclaim to make the cgroup empty...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-24 13:40             ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
Btw:

This has the same problem we'll face for any kmem related memory in the 
cgroup: We can't just force reclaim to make the cgroup empty...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-24 13:40             ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 13:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
Btw:

This has the same problem we'll face for any kmem related memory in the 
cgroup: We can't just force reclaim to make the cgroup empty...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-22  3:17       ` Balbir Singh
  (?)
@ 2011-09-24 14:43         ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 14:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On 09/22/2011 12:17 AM, Balbir Singh wrote:
> On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Hi people,
>>
>> Any insights on this series?
>> Kame, is it inline with your expectations ?
>>
>> Thank you all
>>
>> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>>
>>> This patch lays down the foundation for the kernel memory component
>>> of the Memory Controller.
>>>
>>> As of today, I am only laying down the following files:
>>>
>>>   * memory.independent_kmem_limit
>>>   * memory.kmem.limit_in_bytes (currently ignored)
>>>   * memory.kmem.usage_in_bytes (always zero)
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: Paul Menage<paul@paulmenage.org>
>>> CC: Greg Thelen<gthelen@google.com>
>>> ---
>>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>>   init/Kconfig                     |   11 ++++
>>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 6f3c598..6f1954a 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -44,8 +44,9 @@ Features:
>>>    - oom-killer disable knob and oom-notifier
>>>    - Root cgroup has no limit controls.
>>>
>>> - Kernel memory and Hugepages are not under control yet. We just manage
>>> - pages on LRU. To add more controls, we have to take care of performance.
>>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>>> + controls, we have to take care of performance. Kernel memory support is work
>>> + in progress, and the current version provides basically functionality.
>>>
>>>   Brief summary of control files.
>>>
>>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>>                                  (See 5.5 for details)
>>>    memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>>                                  (See 5.5 for details)
>>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>>> +                                (See 2.7 for details)
>>>    memory.limit_in_bytes                 # set/show limit of memory usage
>>>    memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>>    memory.failcnt                        # show the number of memory usage hits limits
>>>    memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>>    memory.max_usage_in_bytes     # show max memory usage recorded
>>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>>    memory.oom_control            # set/show oom controls.
>>>    memory.numa_stat              # show the number of memory usage per numa node
>>>
>>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>> +                                  independent of user limits
>>> +
>>>   1. History
>>>
>>>   The memory controller has a long history. A request for comments for the memory
>>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>>     zone->lru_lock, it has no lock of its own.
>>>
>>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>> +
>>> + With the Kernel memory extension, the Memory Controller is able to limit
>>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>>> +different than user memory, since it can't be swapped out, which makes it
>>> +possible to DoS the system by consuming too much of this precious resource.
>>> +Kernel memory limits are not imposed for the root cgroup.
>>> +
>>> +Memory limits as specified by the standard Memory Controller may or may not
>>> +take kernel memory into consideration. This is achieved through the file
>>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>>> +memory to be controlled separately.
>>> +
>>> +When kernel memory limits are not independent, the limit values set in
>>> +memory.kmem files are ignored.
>>> +
>>> +Currently no soft limit is implemented for kernel memory. It is future work
>>> +to trigger slab reclaim when those limits are reached.
>>> +
>
> Ying Han was also looking into this (cc'ing her)
>
>>>   3. User Interface
>>>
>>>   0. Configuration
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..49e5839 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           For those who want to have the feature enabled by default should
>>>           select this option (if, for some reason, they need to disable it
>>>           then swapaccount=0 does the trick).
>>> +config CGROUP_MEM_RES_CTLR_KMEM
>>> +       bool "Memory Resource Controller Kernel Memory accounting"
>>> +       depends on CGROUP_MEM_RES_CTLR
>>> +       default y
>>> +       help
>>> +         The Kernel Memory extension for Memory Resource Controller can limit
>>> +         the amount of memory used by kernel objects in the system. Those are
>>> +         fundamentally different from the entities handled by the standard
>>> +         Memory Controller, which are page-based, and can be swapped. Users of
>>> +         the kmem extension can use it to guarantee that no group of processes
>>> +         will ever exhaust kernel resources alone.
>>>
>>>   config CGROUP_PERF
>>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ebd1e86..d32e931 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>>   #define do_swap_account               (0)
>>>   #endif
>>>
>>> -
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +int do_kmem_account __read_mostly = 1;
>>> +#else
>>> +#define do_kmem_account                0
>>> +#endif
>>>   /*
>>>    * Statistics for memory cgroup.
>>>    */
>>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>>          */
>>>         struct res_counter memsw;
>>>         /*
>>> +        * the counter to account for kmem usage.
>>> +        */
>>> +       struct res_counter kmem;
>>> +       /*
>>>          * Per cgroup active and inactive list, similar to the
>>>          * per zone LRU lists.
>>>          */
>>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>>          */
>>>         unsigned long   move_charge_at_immigrate;
>>>         /*
>>> +        * Should kernel memory limits be stabilished independently
>>> +        * from user memory ?
>>> +        */
>>> +       int             kmem_independent;
>>> +       /*
>>>          * percpu counter.
>>>          */
>>>         struct mem_cgroup_stat_cpu *stat;
>>> @@ -388,9 +401,14 @@ enum charge_type {
>>>   };
>>>
>>>   /* for encoding cft->private value on file */
>>> -#define _MEM                   (0)
>>> -#define _MEMSWAP               (1)
>>> -#define _OOM_TYPE              (2)
>>> +
>>> +enum mem_type {
>>> +       _MEM = 0,
>>> +       _MEMSWAP,
>>> +       _OOM_TYPE,
>>> +       _KMEM,
>>> +};
>>> +
>>>   #define MEMFILE_PRIVATE(x, val)       (((x)<<    16) | (val))
>>>   #define MEMFILE_TYPE(val)     (((val)>>    16)&    0xffff)
>>>   #define MEMFILE_ATTR(val)     ((val)&    0xffff)
>>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>>         u64 val;
>>>
>>>         if (!mem_cgroup_is_root(mem)) {
>>> +               val = 0;
>>> +               if (!mem->kmem_independent)
>>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>>                 if (!swap)
>>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>>                 else
>>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +
>>> +               return val;
>>>         }
>>>
>>>         val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>>                 else
>>>                         val = res_counter_read_u64(&mem->memsw, name);
>>>                 break;
>>> +       case _KMEM:
>>> +               val = res_counter_read_u64(&mem->kmem, name);
>>> +               break;
>>> +
>>>         default:
>>>                 BUG();
>>>                 break;
>>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>>         return 0;
>>>   }
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>>> +{
>>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>>> +}
>>> +
>>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>>> +                                       u64 val)
>>> +{
>>> +       cgroup_lock();
>>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>>> +       cgroup_unlock();
>>> +       return 0;
>>> +}
>
> I know we have a lot of pending xxx_from_cont() and struct cgroup
> *cont, can we move it to memcg notation to be more consistent with our
> usage. There is a patch to convert old usage
>

Hello Balbir, I missed this comment. What exactly do you propose in this 
patch, since I have to assume that the patch you talk about is not 
applied? Is it just a change to the parameter name that you propose?

Thank you

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-24 14:43         ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 14:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On 09/22/2011 12:17 AM, Balbir Singh wrote:
> On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Hi people,
>>
>> Any insights on this series?
>> Kame, is it inline with your expectations ?
>>
>> Thank you all
>>
>> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>>
>>> This patch lays down the foundation for the kernel memory component
>>> of the Memory Controller.
>>>
>>> As of today, I am only laying down the following files:
>>>
>>>   * memory.independent_kmem_limit
>>>   * memory.kmem.limit_in_bytes (currently ignored)
>>>   * memory.kmem.usage_in_bytes (always zero)
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: Paul Menage<paul@paulmenage.org>
>>> CC: Greg Thelen<gthelen@google.com>
>>> ---
>>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>>   init/Kconfig                     |   11 ++++
>>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 6f3c598..6f1954a 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -44,8 +44,9 @@ Features:
>>>    - oom-killer disable knob and oom-notifier
>>>    - Root cgroup has no limit controls.
>>>
>>> - Kernel memory and Hugepages are not under control yet. We just manage
>>> - pages on LRU. To add more controls, we have to take care of performance.
>>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>>> + controls, we have to take care of performance. Kernel memory support is work
>>> + in progress, and the current version provides basically functionality.
>>>
>>>   Brief summary of control files.
>>>
>>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>>                                  (See 5.5 for details)
>>>    memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>>                                  (See 5.5 for details)
>>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>>> +                                (See 2.7 for details)
>>>    memory.limit_in_bytes                 # set/show limit of memory usage
>>>    memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>>    memory.failcnt                        # show the number of memory usage hits limits
>>>    memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>>    memory.max_usage_in_bytes     # show max memory usage recorded
>>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>>    memory.oom_control            # set/show oom controls.
>>>    memory.numa_stat              # show the number of memory usage per numa node
>>>
>>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>> +                                  independent of user limits
>>> +
>>>   1. History
>>>
>>>   The memory controller has a long history. A request for comments for the memory
>>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>>     zone->lru_lock, it has no lock of its own.
>>>
>>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>> +
>>> + With the Kernel memory extension, the Memory Controller is able to limit
>>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>>> +different than user memory, since it can't be swapped out, which makes it
>>> +possible to DoS the system by consuming too much of this precious resource.
>>> +Kernel memory limits are not imposed for the root cgroup.
>>> +
>>> +Memory limits as specified by the standard Memory Controller may or may not
>>> +take kernel memory into consideration. This is achieved through the file
>>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>>> +memory to be controlled separately.
>>> +
>>> +When kernel memory limits are not independent, the limit values set in
>>> +memory.kmem files are ignored.
>>> +
>>> +Currently no soft limit is implemented for kernel memory. It is future work
>>> +to trigger slab reclaim when those limits are reached.
>>> +
>
> Ying Han was also looking into this (cc'ing her)
>
>>>   3. User Interface
>>>
>>>   0. Configuration
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..49e5839 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           For those who want to have the feature enabled by default should
>>>           select this option (if, for some reason, they need to disable it
>>>           then swapaccount=0 does the trick).
>>> +config CGROUP_MEM_RES_CTLR_KMEM
>>> +       bool "Memory Resource Controller Kernel Memory accounting"
>>> +       depends on CGROUP_MEM_RES_CTLR
>>> +       default y
>>> +       help
>>> +         The Kernel Memory extension for Memory Resource Controller can limit
>>> +         the amount of memory used by kernel objects in the system. Those are
>>> +         fundamentally different from the entities handled by the standard
>>> +         Memory Controller, which are page-based, and can be swapped. Users of
>>> +         the kmem extension can use it to guarantee that no group of processes
>>> +         will ever exhaust kernel resources alone.
>>>
>>>   config CGROUP_PERF
>>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ebd1e86..d32e931 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>>   #define do_swap_account               (0)
>>>   #endif
>>>
>>> -
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +int do_kmem_account __read_mostly = 1;
>>> +#else
>>> +#define do_kmem_account                0
>>> +#endif
>>>   /*
>>>    * Statistics for memory cgroup.
>>>    */
>>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>>          */
>>>         struct res_counter memsw;
>>>         /*
>>> +        * the counter to account for kmem usage.
>>> +        */
>>> +       struct res_counter kmem;
>>> +       /*
>>>          * Per cgroup active and inactive list, similar to the
>>>          * per zone LRU lists.
>>>          */
>>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>>          */
>>>         unsigned long   move_charge_at_immigrate;
>>>         /*
>>> +        * Should kernel memory limits be stabilished independently
>>> +        * from user memory ?
>>> +        */
>>> +       int             kmem_independent;
>>> +       /*
>>>          * percpu counter.
>>>          */
>>>         struct mem_cgroup_stat_cpu *stat;
>>> @@ -388,9 +401,14 @@ enum charge_type {
>>>   };
>>>
>>>   /* for encoding cft->private value on file */
>>> -#define _MEM                   (0)
>>> -#define _MEMSWAP               (1)
>>> -#define _OOM_TYPE              (2)
>>> +
>>> +enum mem_type {
>>> +       _MEM = 0,
>>> +       _MEMSWAP,
>>> +       _OOM_TYPE,
>>> +       _KMEM,
>>> +};
>>> +
>>>   #define MEMFILE_PRIVATE(x, val)       (((x)<<    16) | (val))
>>>   #define MEMFILE_TYPE(val)     (((val)>>    16)&    0xffff)
>>>   #define MEMFILE_ATTR(val)     ((val)&    0xffff)
>>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>>         u64 val;
>>>
>>>         if (!mem_cgroup_is_root(mem)) {
>>> +               val = 0;
>>> +               if (!mem->kmem_independent)
>>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>>                 if (!swap)
>>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>>                 else
>>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +
>>> +               return val;
>>>         }
>>>
>>>         val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>>                 else
>>>                         val = res_counter_read_u64(&mem->memsw, name);
>>>                 break;
>>> +       case _KMEM:
>>> +               val = res_counter_read_u64(&mem->kmem, name);
>>> +               break;
>>> +
>>>         default:
>>>                 BUG();
>>>                 break;
>>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>>         return 0;
>>>   }
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>>> +{
>>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>>> +}
>>> +
>>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>>> +                                       u64 val)
>>> +{
>>> +       cgroup_lock();
>>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>>> +       cgroup_unlock();
>>> +       return 0;
>>> +}
>
> I know we have a lot of pending xxx_from_cont() and struct cgroup
> *cont, can we move it to memcg notation to be more consistent with our
> usage. There is a patch to convert old usage
>

Hello Balbir, I missed this comment. What exactly do you propose in this 
patch, since I have to assume that the patch you talk about is not 
applied? Is it just a change to the parameter name that you propose?

Thank you

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-24 14:43         ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 14:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

On 09/22/2011 12:17 AM, Balbir Singh wrote:
> On Wed, Sep 21, 2011 at 7:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>
>> Hi people,
>>
>> Any insights on this series?
>> Kame, is it inline with your expectations ?
>>
>> Thank you all
>>
>> On 09/18/2011 09:56 PM, Glauber Costa wrote:
>>>
>>> This patch lays down the foundation for the kernel memory component
>>> of the Memory Controller.
>>>
>>> As of today, I am only laying down the following files:
>>>
>>>   * memory.independent_kmem_limit
>>>   * memory.kmem.limit_in_bytes (currently ignored)
>>>   * memory.kmem.usage_in_bytes (always zero)
>>>
>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>> CC: Paul Menage<paul@paulmenage.org>
>>> CC: Greg Thelen<gthelen@google.com>
>>> ---
>>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>>   init/Kconfig                     |   11 ++++
>>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 6f3c598..6f1954a 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -44,8 +44,9 @@ Features:
>>>    - oom-killer disable knob and oom-notifier
>>>    - Root cgroup has no limit controls.
>>>
>>> - Kernel memory and Hugepages are not under control yet. We just manage
>>> - pages on LRU. To add more controls, we have to take care of performance.
>>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>>> + controls, we have to take care of performance. Kernel memory support is work
>>> + in progress, and the current version provides basically functionality.
>>>
>>>   Brief summary of control files.
>>>
>>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>>                                  (See 5.5 for details)
>>>    memory.memsw.usage_in_bytes   # show current res_counter usage for memory+Swap
>>>                                  (See 5.5 for details)
>>> + memory.kmem.usage_in_bytes     # show current res_counter usage for kmem only.
>>> +                                (See 2.7 for details)
>>>    memory.limit_in_bytes                 # set/show limit of memory usage
>>>    memory.memsw.limit_in_bytes   # set/show limit of memory+Swap usage
>>> + memory.kmem.limit_in_bytes     # if allowed, set/show limit of kernel memory
>>>    memory.failcnt                        # show the number of memory usage hits limits
>>>    memory.memsw.failcnt          # show the number of memory+Swap hits limits
>>>    memory.max_usage_in_bytes     # show max memory usage recorded
>>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>>    memory.oom_control            # set/show oom controls.
>>>    memory.numa_stat              # show the number of memory usage per numa node
>>>
>>> + memory.independent_kmem_limit  # select whether or not kernel memory limits are
>>> +                                  independent of user limits
>>> +
>>>   1. History
>>>
>>>   The memory controller has a long history. A request for comments for the memory
>>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>>     zone->lru_lock, it has no lock of its own.
>>>
>>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>> +
>>> + With the Kernel memory extension, the Memory Controller is able to limit
>>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>>> +different than user memory, since it can't be swapped out, which makes it
>>> +possible to DoS the system by consuming too much of this precious resource.
>>> +Kernel memory limits are not imposed for the root cgroup.
>>> +
>>> +Memory limits as specified by the standard Memory Controller may or may not
>>> +take kernel memory into consideration. This is achieved through the file
>>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>>> +memory to be controlled separately.
>>> +
>>> +When kernel memory limits are not independent, the limit values set in
>>> +memory.kmem files are ignored.
>>> +
>>> +Currently no soft limit is implemented for kernel memory. It is future work
>>> +to trigger slab reclaim when those limits are reached.
>>> +
>
> Ying Han was also looking into this (cc'ing her)
>
>>>   3. User Interface
>>>
>>>   0. Configuration
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index d627783..49e5839 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           For those who want to have the feature enabled by default should
>>>           select this option (if, for some reason, they need to disable it
>>>           then swapaccount=0 does the trick).
>>> +config CGROUP_MEM_RES_CTLR_KMEM
>>> +       bool "Memory Resource Controller Kernel Memory accounting"
>>> +       depends on CGROUP_MEM_RES_CTLR
>>> +       default y
>>> +       help
>>> +         The Kernel Memory extension for Memory Resource Controller can limit
>>> +         the amount of memory used by kernel objects in the system. Those are
>>> +         fundamentally different from the entities handled by the standard
>>> +         Memory Controller, which are page-based, and can be swapped. Users of
>>> +         the kmem extension can use it to guarantee that no group of processes
>>> +         will ever exhaust kernel resources alone.
>>>
>>>   config CGROUP_PERF
>>>         bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ebd1e86..d32e931 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>>   #define do_swap_account               (0)
>>>   #endif
>>>
>>> -
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +int do_kmem_account __read_mostly = 1;
>>> +#else
>>> +#define do_kmem_account                0
>>> +#endif
>>>   /*
>>>    * Statistics for memory cgroup.
>>>    */
>>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>>          */
>>>         struct res_counter memsw;
>>>         /*
>>> +        * the counter to account for kmem usage.
>>> +        */
>>> +       struct res_counter kmem;
>>> +       /*
>>>          * Per cgroup active and inactive list, similar to the
>>>          * per zone LRU lists.
>>>          */
>>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>>          */
>>>         unsigned long   move_charge_at_immigrate;
>>>         /*
>>> +        * Should kernel memory limits be stabilished independently
>>> +        * from user memory ?
>>> +        */
>>> +       int             kmem_independent;
>>> +       /*
>>>          * percpu counter.
>>>          */
>>>         struct mem_cgroup_stat_cpu *stat;
>>> @@ -388,9 +401,14 @@ enum charge_type {
>>>   };
>>>
>>>   /* for encoding cft->private value on file */
>>> -#define _MEM                   (0)
>>> -#define _MEMSWAP               (1)
>>> -#define _OOM_TYPE              (2)
>>> +
>>> +enum mem_type {
>>> +       _MEM = 0,
>>> +       _MEMSWAP,
>>> +       _OOM_TYPE,
>>> +       _KMEM,
>>> +};
>>> +
>>>   #define MEMFILE_PRIVATE(x, val)       (((x)<<    16) | (val))
>>>   #define MEMFILE_TYPE(val)     (((val)>>    16)&    0xffff)
>>>   #define MEMFILE_ATTR(val)     ((val)&    0xffff)
>>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>>         u64 val;
>>>
>>>         if (!mem_cgroup_is_root(mem)) {
>>> +               val = 0;
>>> +               if (!mem->kmem_independent)
>>> +                       val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>>>                 if (!swap)
>>> -                       return res_counter_read_u64(&mem->res, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->res, RES_USAGE);
>>>                 else
>>> -                       return res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +                       val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>>> +
>>> +               return val;
>>>         }
>>>
>>>         val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>>                 else
>>>                         val = res_counter_read_u64(&mem->memsw, name);
>>>                 break;
>>> +       case _KMEM:
>>> +               val = res_counter_read_u64(&mem->kmem, name);
>>> +               break;
>>> +
>>>         default:
>>>                 BUG();
>>>                 break;
>>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>>         return 0;
>>>   }
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>>> +{
>>> +       return mem_cgroup_from_cont(cont)->kmem_independent;
>>> +}
>>> +
>>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>>> +                                       u64 val)
>>> +{
>>> +       cgroup_lock();
>>> +       mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>>> +       cgroup_unlock();
>>> +       return 0;
>>> +}
>
> I know we have a lot of pending xxx_from_cont() and struct cgroup
> *cont, can we move it to memcg notation to be more consistent with our
> usage. There is a patch to convert old usage
>

Hello Balbir, I missed this comment. What exactly do you propose in this 
patch, since I have to assume that the patch you talk about is not 
applied? Is it just a change to the parameter name that you propose?

Thank you

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-22 15:09           ` Balbir Singh
  (?)
@ 2011-09-24 14:45             ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 14:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
Do you guys see it as a showstopper for this series to be merged, or can 
we just TODO it ?

I can push a proposal for it, but it would be done in a separate patch 
anyway. Also, we may be in better conditions to fix this when the slab 
part is merged - since it will likely have the same problems...



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-24 14:45             ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 14:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
Do you guys see it as a showstopper for this series to be merged, or can 
we just TODO it ?

I can push a proposal for it, but it would be done in a separate patch 
anyway. Also, we may be in better conditions to fix this when the slab 
part is merged - since it will likely have the same problems...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-24 14:45             ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 14:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Greg Thelen, linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/22/2011 12:09 PM, Balbir Singh wrote:
> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
>>> Right now I am working under the assumption that tasks are long lived inside
>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>> the mem_schedule path.
>>>
>>> Also, unless I am missing something, the memcg already has the policy of
>>> not carrying charges around, probably because of this very same complexity.
>>>
>>> True that at least it won't EBUSY you... But I think this is at least a way
>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>> our allocations.
>>
>> Here's the memcg user page behavior using the same pattern:
>>
>> 1. user page P is allocate by task T in memcg M1
>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>> M2 if memory.move_charge_at_immigrate=1.
>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>> reclaim, then P is recharged to parent(M1).
>>
>
> We also have some magic in page_referenced() to remove pages
> referenced from different containers. What we do is try not to
> penalize a cgroup if another cgroup is referencing this page and the
> page under consideration is being reclaimed from the cgroup that
> touched it.
>
> Balbir Singh
Do you guys see it as a showstopper for this series to be merged, or can 
we just TODO it ?

I can push a proposal for it, but it would be done in a separate patch 
anyway. Also, we may be in better conditions to fix this when the slab 
part is merged - since it will likely have the same problems...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-19  0:56   ` Glauber Costa
  (?)
@ 2011-09-24 16:58     ` Andi Kleen
  -1 siblings, 0 replies; 119+ messages in thread
From: Andi Kleen @ 2011-09-24 16:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

Glauber Costa <glommer@parallels.com> writes:

> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> effectively control the amount of kernel memory pinned by a cgroup.
>
> We have to make sure that none of the memory pressure thresholds
> specified in the namespace are bigger than the current cgroup.

I noticed that some other OS known by bash seem to have a rlimit per
process for this. Would that make sense too? Not sure how difficult
your infrastructure would be to extend to that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 16:58     ` Andi Kleen
  0 siblings, 0 replies; 119+ messages in thread
From: Andi Kleen @ 2011-09-24 16:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

Glauber Costa <glommer@parallels.com> writes:

> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> effectively control the amount of kernel memory pinned by a cgroup.
>
> We have to make sure that none of the memory pressure thresholds
> specified in the namespace are bigger than the current cgroup.

I noticed that some other OS known by bash seem to have a rlimit per
process for this. Would that make sense too? Not sure how difficult
your infrastructure would be to extend to that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 16:58     ` Andi Kleen
  0 siblings, 0 replies; 119+ messages in thread
From: Andi Kleen @ 2011-09-24 16:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

Glauber Costa <glommer@parallels.com> writes:

> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> effectively control the amount of kernel memory pinned by a cgroup.
>
> We have to make sure that none of the memory pressure thresholds
> specified in the namespace are bigger than the current cgroup.

I noticed that some other OS known by bash seem to have a rlimit per
process for this. Would that make sense too? Not sure how difficult
your infrastructure would be to extend to that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-24 16:58     ` Andi Kleen
  (?)
@ 2011-09-24 17:27       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 17:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/24/2011 01:58 PM, Andi Kleen wrote:
> Glauber Costa<glommer@parallels.com>  writes:
>
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>
> I noticed that some other OS known by bash seem to have a rlimit per
> process for this. Would that make sense too? Not sure how difficult
> your infrastructure would be to extend to that.
>
> -Andi
>
Well, not that hard, I believe.

and given the benchmarks I've run in this iteration, I think it wouldn't
be that much of a performance impact either. We just need to account it 
to a task whenever we account it for a control group. Now that the 
functions where accounting are done are abstracted away, it is even 
quite few places to touch.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 17:27       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 17:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/24/2011 01:58 PM, Andi Kleen wrote:
> Glauber Costa<glommer@parallels.com>  writes:
>
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>
> I noticed that some other OS known by bash seem to have a rlimit per
> process for this. Would that make sense too? Not sure how difficult
> your infrastructure would be to extend to that.
>
> -Andi
>
Well, not that hard, I believe.

and given the benchmarks I've run in this iteration, I think it wouldn't
be that much of a performance impact either. We just need to account it 
to a task whenever we account it for a control group. Now that the 
functions where accounting are done are abstracted away, it is even 
quite few places to touch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-24 17:27       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-24 17:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/24/2011 01:58 PM, Andi Kleen wrote:
> Glauber Costa<glommer@parallels.com>  writes:
>
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>
> I noticed that some other OS known by bash seem to have a rlimit per
> process for this. Would that make sense too? Not sure how difficult
> your infrastructure would be to extend to that.
>
> -Andi
>
Well, not that hard, I believe.

and given the benchmarks I've run in this iteration, I think it wouldn't
be that much of a performance impact either. We just need to account it 
to a task whenever we account it for a control group. Now that the 
functions where accounting are done are abstracted away, it is even 
quite few places to touch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-26 10:34     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Sun, 18 Sep 2011 21:56:39 -0300
Glauber Costa <glommer@parallels.com> wrote:

> This patch lays down the foundation for the kernel memory component
> of the Memory Controller.
> 
> As of today, I am only laying down the following files:
> 
>  * memory.independent_kmem_limit
>  * memory.kmem.limit_in_bytes (currently ignored)
>  * memory.kmem.usage_in_bytes (always zero)
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Paul Menage <paul@paulmenage.org>
> CC: Greg Thelen <gthelen@google.com>

I'm sorry that my slow review is delaying you.


> ---
>  Documentation/cgroups/memory.txt |   30 +++++++++-
>  init/Kconfig                     |   11 ++++
>  mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>  3 files changed, 148 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f3c598..6f1954a 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -44,8 +44,9 @@ Features:
>   - oom-killer disable knob and oom-notifier
>   - Root cgroup has no limit controls.
>  
> - Kernel memory and Hugepages are not under control yet. We just manage
> - pages on LRU. To add more controls, we have to take care of performance.
> + Hugepages is not under control yet. We just manage pages on LRU. To add more
> + controls, we have to take care of performance. Kernel memory support is work
> + in progress, and the current version provides basically functionality.
>  
>  Brief summary of control files.
>  
> @@ -56,8 +57,11 @@ Brief summary of control files.
>  				 (See 5.5 for details)
>   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>  				 (See 5.5 for details)
> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
> +				 (See 2.7 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>   memory.failcnt			 # show the number of memory usage hits limits
>   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>   memory.max_usage_in_bytes	 # show max memory usage recorded
> @@ -72,6 +76,9 @@ Brief summary of control files.
>   memory.oom_control		 # set/show oom controls.
>   memory.numa_stat		 # show the number of memory usage per numa node
>  
> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> +				   independent of user limits
> +
>  1. History
>  
>  The memory controller has a long history. A request for comments for the memory
> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>    zone->lru_lock, it has no lock of its own.
>  
> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> +
> + With the Kernel memory extension, the Memory Controller is able to limit
> +the amount of kernel memory used by the system. Kernel memory is fundamentally
> +different than user memory, since it can't be swapped out, which makes it
> +possible to DoS the system by consuming too much of this precious resource.
> +Kernel memory limits are not imposed for the root cgroup.
> +
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
> +memory to be controlled separately.
> +
> +When kernel memory limits are not independent, the limit values set in
> +memory.kmem files are ignored.
> +
> +Currently no soft limit is implemented for kernel memory. It is future work
> +to trigger slab reclaim when those limits are reached.
> +
>  3. User Interface
>  
>  0. Configuration
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..49e5839 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>  	  For those who want to have the feature enabled by default should
>  	  select this option (if, for some reason, they need to disable it
>  	  then swapaccount=0 does the trick).
> +config CGROUP_MEM_RES_CTLR_KMEM
> +	bool "Memory Resource Controller Kernel Memory accounting"
> +	depends on CGROUP_MEM_RES_CTLR
> +	default y
> +	help
> +	  The Kernel Memory extension for Memory Resource Controller can limit
> +	  the amount of memory used by kernel objects in the system. Those are
> +	  fundamentally different from the entities handled by the standard
> +	  Memory Controller, which are page-based, and can be swapped. Users of
> +	  the kmem extension can use it to guarantee that no group of processes
> +	  will ever exhaust kernel resources alone.
>  
>  config CGROUP_PERF
>  	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebd1e86..d32e931 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>  #define do_swap_account		(0)
>  #endif
>  
> -
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int do_kmem_account __read_mostly = 1;
> +#else
> +#define do_kmem_account		0
> +#endif


Hmm, do we really need this boot option ?
>From my experience to have swap-accounting boot option,
this scares us ;) I think config is enough.




>  /*
>   * Statistics for memory cgroup.
>   */
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>  	 */
>  	struct res_counter memsw;
>  	/*
> +	 * the counter to account for kmem usage.
> +	 */
> +	struct res_counter kmem;
> +	/*
>  	 * Per cgroup active and inactive list, similar to the
>  	 * per zone LRU lists.
>  	 */
> @@ -321,6 +329,11 @@ struct mem_cgroup {
>  	 */
>  	unsigned long 	move_charge_at_immigrate;
>  	/*
> +	 * Should kernel memory limits be stabilished independently
> +	 * from user memory ?
> +	 */
> +	int		kmem_independent;
> +	/*
>  	 * percpu counter.
>  	 */
>  	struct mem_cgroup_stat_cpu *stat;
> @@ -388,9 +401,14 @@ enum charge_type {
>  };
>  
>  /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define _OOM_TYPE		(2)
> +
> +enum mem_type {
> +	_MEM = 0,
> +	_MEMSWAP,
> +	_OOM_TYPE,
> +	_KMEM,
> +};
> +

ok, nice clean up.


>  #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
>  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>  	u64 val;
>  
>  	if (!mem_cgroup_is_root(mem)) {
> +		val = 0;
> +		if (!mem->kmem_independent)
> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);

>  		if (!swap)
> -			return res_counter_read_u64(&mem->res, RES_USAGE);
> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>  		else
> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
> +
> +		return val;
>  	}
>  
>  	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>  		else
>  			val = res_counter_read_u64(&mem->memsw, name);
>  		break;
> +	case _KMEM:
> +		val = res_counter_read_u64(&mem->kmem, name);
> +		break;
> +
>  	default:
>  		BUG();
>  		break;
> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cont)->kmem_independent;
> +}
> +
> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	cgroup_lock();
> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
> +	cgroup_unlock();

Hm. This code allows that parent/child can have different settings.
Could you add parent-child check as..

"If parent sets use_hierarchy==1, children must have the same kmem_independent value
with parant's one."

How do you think ? I think a hierarchy must have the same config.


BTW...I don't like naming a little ;)

memory->consolidated/shared/?????_kmem_accounting ?
Or
memory->kmem_independent_accounting ?

or some better naming ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-26 10:34     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Sun, 18 Sep 2011 21:56:39 -0300
Glauber Costa <glommer@parallels.com> wrote:

> This patch lays down the foundation for the kernel memory component
> of the Memory Controller.
> 
> As of today, I am only laying down the following files:
> 
>  * memory.independent_kmem_limit
>  * memory.kmem.limit_in_bytes (currently ignored)
>  * memory.kmem.usage_in_bytes (always zero)
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Paul Menage <paul@paulmenage.org>
> CC: Greg Thelen <gthelen@google.com>

I'm sorry that my slow review is delaying you.


> ---
>  Documentation/cgroups/memory.txt |   30 +++++++++-
>  init/Kconfig                     |   11 ++++
>  mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>  3 files changed, 148 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 6f3c598..6f1954a 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -44,8 +44,9 @@ Features:
>   - oom-killer disable knob and oom-notifier
>   - Root cgroup has no limit controls.
>  
> - Kernel memory and Hugepages are not under control yet. We just manage
> - pages on LRU. To add more controls, we have to take care of performance.
> + Hugepages is not under control yet. We just manage pages on LRU. To add more
> + controls, we have to take care of performance. Kernel memory support is work
> + in progress, and the current version provides basically functionality.
>  
>  Brief summary of control files.
>  
> @@ -56,8 +57,11 @@ Brief summary of control files.
>  				 (See 5.5 for details)
>   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>  				 (See 5.5 for details)
> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
> +				 (See 2.7 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>   memory.failcnt			 # show the number of memory usage hits limits
>   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>   memory.max_usage_in_bytes	 # show max memory usage recorded
> @@ -72,6 +76,9 @@ Brief summary of control files.
>   memory.oom_control		 # set/show oom controls.
>   memory.numa_stat		 # show the number of memory usage per numa node
>  
> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> +				   independent of user limits
> +
>  1. History
>  
>  The memory controller has a long history. A request for comments for the memory
> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>    zone->lru_lock, it has no lock of its own.
>  
> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> +
> + With the Kernel memory extension, the Memory Controller is able to limit
> +the amount of kernel memory used by the system. Kernel memory is fundamentally
> +different than user memory, since it can't be swapped out, which makes it
> +possible to DoS the system by consuming too much of this precious resource.
> +Kernel memory limits are not imposed for the root cgroup.
> +
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
> +memory to be controlled separately.
> +
> +When kernel memory limits are not independent, the limit values set in
> +memory.kmem files are ignored.
> +
> +Currently no soft limit is implemented for kernel memory. It is future work
> +to trigger slab reclaim when those limits are reached.
> +
>  3. User Interface
>  
>  0. Configuration
> diff --git a/init/Kconfig b/init/Kconfig
> index d627783..49e5839 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>  	  For those who want to have the feature enabled by default should
>  	  select this option (if, for some reason, they need to disable it
>  	  then swapaccount=0 does the trick).
> +config CGROUP_MEM_RES_CTLR_KMEM
> +	bool "Memory Resource Controller Kernel Memory accounting"
> +	depends on CGROUP_MEM_RES_CTLR
> +	default y
> +	help
> +	  The Kernel Memory extension for Memory Resource Controller can limit
> +	  the amount of memory used by kernel objects in the system. Those are
> +	  fundamentally different from the entities handled by the standard
> +	  Memory Controller, which are page-based, and can be swapped. Users of
> +	  the kmem extension can use it to guarantee that no group of processes
> +	  will ever exhaust kernel resources alone.
>  
>  config CGROUP_PERF
>  	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebd1e86..d32e931 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>  #define do_swap_account		(0)
>  #endif
>  
> -
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int do_kmem_account __read_mostly = 1;
> +#else
> +#define do_kmem_account		0
> +#endif


Hmm, do we really need this boot option ?
>From my experience to have swap-accounting boot option,
this scares us ;) I think config is enough.




>  /*
>   * Statistics for memory cgroup.
>   */
> @@ -270,6 +274,10 @@ struct mem_cgroup {
>  	 */
>  	struct res_counter memsw;
>  	/*
> +	 * the counter to account for kmem usage.
> +	 */
> +	struct res_counter kmem;
> +	/*
>  	 * Per cgroup active and inactive list, similar to the
>  	 * per zone LRU lists.
>  	 */
> @@ -321,6 +329,11 @@ struct mem_cgroup {
>  	 */
>  	unsigned long 	move_charge_at_immigrate;
>  	/*
> +	 * Should kernel memory limits be stabilished independently
> +	 * from user memory ?
> +	 */
> +	int		kmem_independent;
> +	/*
>  	 * percpu counter.
>  	 */
>  	struct mem_cgroup_stat_cpu *stat;
> @@ -388,9 +401,14 @@ enum charge_type {
>  };
>  
>  /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define _OOM_TYPE		(2)
> +
> +enum mem_type {
> +	_MEM = 0,
> +	_MEMSWAP,
> +	_OOM_TYPE,
> +	_KMEM,
> +};
> +

ok, nice clean up.


>  #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
>  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>  	u64 val;
>  
>  	if (!mem_cgroup_is_root(mem)) {
> +		val = 0;
> +		if (!mem->kmem_independent)
> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);

>  		if (!swap)
> -			return res_counter_read_u64(&mem->res, RES_USAGE);
> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>  		else
> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
> +
> +		return val;
>  	}
>  
>  	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>  		else
>  			val = res_counter_read_u64(&mem->memsw, name);
>  		break;
> +	case _KMEM:
> +		val = res_counter_read_u64(&mem->kmem, name);
> +		break;
> +
>  	default:
>  		BUG();
>  		break;
> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cont)->kmem_independent;
> +}
> +
> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	cgroup_lock();
> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
> +	cgroup_unlock();

Hm. This code allows that parent/child can have different settings.
Could you add parent-child check as..

"If parent sets use_hierarchy==1, children must have the same kmem_independent value
with parant's one."

How do you think ? I think a hierarchy must have the same config.


BTW...I don't like naming a little ;)

memory->consolidated/shared/?????_kmem_accounting ?
Or
memory->kmem_independent_accounting ?

or some better naming ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-24 14:45             ` Glauber Costa
  (?)
@ 2011-09-26 10:52               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:52 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On Sat, 24 Sep 2011 11:45:04 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/22/2011 12:09 PM, Balbir Singh wrote:
> > On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
> >> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
> >>> Right now I am working under the assumption that tasks are long lived inside
> >>> the cgroup. Migration potentially introduces some nasty locking problems in
> >>> the mem_schedule path.
> >>>
> >>> Also, unless I am missing something, the memcg already has the policy of
> >>> not carrying charges around, probably because of this very same complexity.
> >>>
> >>> True that at least it won't EBUSY you... But I think this is at least a way
> >>> to guarantee that the cgroup under our nose won't disappear in the middle of
> >>> our allocations.
> >>
> >> Here's the memcg user page behavior using the same pattern:
> >>
> >> 1. user page P is allocate by task T in memcg M1
> >> 2. T is moved to memcg M2.  The P charge is left behind still charged
> >> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> >> M2 if memory.move_charge_at_immigrate=1.
> >> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> >> reclaim, then P is recharged to parent(M1).
> >>
> >
> > We also have some magic in page_referenced() to remove pages
> > referenced from different containers. What we do is try not to
> > penalize a cgroup if another cgroup is referencing this page and the
> > page under consideration is being reclaimed from the cgroup that
> > touched it.
> >
> > Balbir Singh
> Do you guys see it as a showstopper for this series to be merged, or can 
> we just TODO it ?
> 

In my experience, 'I can't rmdir cgroup.' is always an important/difficult
problem. The users cannot know where the accouting is leaking other than
kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.

please add EXPERIMENTAL to Kconfig until this is fixed.

> I can push a proposal for it, but it would be done in a separate patch 
> anyway. Also, we may be in better conditions to fix this when the slab 
> part is merged - since it will likely have the same problems...
> 

Yes. considering sockets which can be shared between tasks(cgroups)
you'll finally need
  - owner task of socket 
  - account moving callback

Or disallow task moving once accounted.


Thanks,
-Kame









^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-26 10:52               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:52 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On Sat, 24 Sep 2011 11:45:04 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/22/2011 12:09 PM, Balbir Singh wrote:
> > On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
> >> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
> >>> Right now I am working under the assumption that tasks are long lived inside
> >>> the cgroup. Migration potentially introduces some nasty locking problems in
> >>> the mem_schedule path.
> >>>
> >>> Also, unless I am missing something, the memcg already has the policy of
> >>> not carrying charges around, probably because of this very same complexity.
> >>>
> >>> True that at least it won't EBUSY you... But I think this is at least a way
> >>> to guarantee that the cgroup under our nose won't disappear in the middle of
> >>> our allocations.
> >>
> >> Here's the memcg user page behavior using the same pattern:
> >>
> >> 1. user page P is allocate by task T in memcg M1
> >> 2. T is moved to memcg M2.  The P charge is left behind still charged
> >> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> >> M2 if memory.move_charge_at_immigrate=1.
> >> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> >> reclaim, then P is recharged to parent(M1).
> >>
> >
> > We also have some magic in page_referenced() to remove pages
> > referenced from different containers. What we do is try not to
> > penalize a cgroup if another cgroup is referencing this page and the
> > page under consideration is being reclaimed from the cgroup that
> > touched it.
> >
> > Balbir Singh
> Do you guys see it as a showstopper for this series to be merged, or can 
> we just TODO it ?
> 

In my experience, 'I can't rmdir cgroup.' is always an important/difficult
problem. The users cannot know where the accouting is leaking other than
kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.

please add EXPERIMENTAL to Kconfig until this is fixed.

> I can push a proposal for it, but it would be done in a separate patch 
> anyway. Also, we may be in better conditions to fix this when the slab 
> part is merged - since it will likely have the same problems...
> 

Yes. considering sockets which can be shared between tasks(cgroups)
you'll finally need
  - owner task of socket 
  - account moving callback

Or disallow task moving once accounted.


Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-26 10:52               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:52 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On Sat, 24 Sep 2011 11:45:04 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/22/2011 12:09 PM, Balbir Singh wrote:
> > On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>  wrote:
> >> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>  wrote:
> >>> Right now I am working under the assumption that tasks are long lived inside
> >>> the cgroup. Migration potentially introduces some nasty locking problems in
> >>> the mem_schedule path.
> >>>
> >>> Also, unless I am missing something, the memcg already has the policy of
> >>> not carrying charges around, probably because of this very same complexity.
> >>>
> >>> True that at least it won't EBUSY you... But I think this is at least a way
> >>> to guarantee that the cgroup under our nose won't disappear in the middle of
> >>> our allocations.
> >>
> >> Here's the memcg user page behavior using the same pattern:
> >>
> >> 1. user page P is allocate by task T in memcg M1
> >> 2. T is moved to memcg M2.  The P charge is left behind still charged
> >> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> >> M2 if memory.move_charge_at_immigrate=1.
> >> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> >> reclaim, then P is recharged to parent(M1).
> >>
> >
> > We also have some magic in page_referenced() to remove pages
> > referenced from different containers. What we do is try not to
> > penalize a cgroup if another cgroup is referencing this page and the
> > page under consideration is being reclaimed from the cgroup that
> > touched it.
> >
> > Balbir Singh
> Do you guys see it as a showstopper for this series to be merged, or can 
> we just TODO it ?
> 

In my experience, 'I can't rmdir cgroup.' is always an important/difficult
problem. The users cannot know where the accouting is leaking other than
kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.

please add EXPERIMENTAL to Kconfig until this is fixed.

> I can push a proposal for it, but it would be done in a separate patch 
> anyway. Also, we may be in better conditions to fix this when the slab 
> part is merged - since it will likely have the same problems...
> 

Yes. considering sockets which can be shared between tasks(cgroups)
you'll finally need
  - owner task of socket 
  - account moving callback

Or disallow task moving once accounted.


Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-26 10:59     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Sun, 18 Sep 2011 21:56:42 -0300
Glauber Costa <glommer@parallels.com> wrote:

> With all the infrastructure in place, this patch implements
> per-cgroup control for tcp memory pressure handling.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>

a comment below.

> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
> +		    struct cgroup_subsys *ss)
> +{
> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
> +	unsigned long limit;
> +
> +	cg->tcp_memory_pressure = 0;
> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
> +
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +
> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
> +

Then, the parameter doesn't inherit parent's one ?

I think sockets_populate should pass 'parent' and


I think you should have a function 

    mem_cgroup_should_inherit_parent_settings(parent)

(This is because you made this feature as a part of memcg.
 please provide expected behavior.)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-26 10:59     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 10:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Sun, 18 Sep 2011 21:56:42 -0300
Glauber Costa <glommer@parallels.com> wrote:

> With all the infrastructure in place, this patch implements
> per-cgroup control for tcp memory pressure handling.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>

a comment below.

> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
> +		    struct cgroup_subsys *ss)
> +{
> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
> +	unsigned long limit;
> +
> +	cg->tcp_memory_pressure = 0;
> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
> +
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +
> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
> +

Then, the parameter doesn't inherit parent's one ?

I think sockets_populate should pass 'parent' and


I think you should have a function 

    mem_cgroup_should_inherit_parent_settings(parent)

(This is because you made this feature as a part of memcg.
 please provide expected behavior.)

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-24 13:30       ` Glauber Costa
  (?)
@ 2011-09-26 11:02         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 11:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Greg Thelen, linux-kernel, paul, lizf, ebiederm, davem, netdev,
	linux-mm, kirill

On Sat, 24 Sep 2011 10:30:42 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/22/2011 03:01 AM, Greg Thelen wrote:
> > On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
> >> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> >> +{
> >> +       return (mem == root_mem_cgroup);
> >> +}
> >> +
> >
> > Why are you adding a copy of mem_cgroup_is_root().  I see one already
> > in v3.0.  Was it deleted in a previous patch?
> 
> Already answered by another good samaritan.
> 
> >> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> >> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> >> +       struct net *net = current->nsproxy->net_ns;
> >> +       int i;
> >> +
> >> +       if (!cgroup_lock_live_group(cgrp))
> >> +               return -ENODEV;
> >
> > Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> > to sg->tcp_prot_mem[*]?
> >
> >> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> >> +       u64 ret;
> >> +
> >> +       if (!cgroup_lock_live_group(cgrp))
> >> +               return -ENODEV;
> >
> > Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> > to sg->tcp_max_memory?
> 
> No, that is not my understanding. My understanding is this lock is 
> needed to protect against the cgroup just disappearing under our nose.
> 

Hm. reference count of dentry for cgroup isn't enough ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-26 11:02         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 11:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Greg Thelen, linux-kernel, paul, lizf, ebiederm, davem, netdev,
	linux-mm, kirill

On Sat, 24 Sep 2011 10:30:42 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/22/2011 03:01 AM, Greg Thelen wrote:
> > On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
> >> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> >> +{
> >> +       return (mem == root_mem_cgroup);
> >> +}
> >> +
> >
> > Why are you adding a copy of mem_cgroup_is_root().  I see one already
> > in v3.0.  Was it deleted in a previous patch?
> 
> Already answered by another good samaritan.
> 
> >> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> >> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> >> +       struct net *net = current->nsproxy->net_ns;
> >> +       int i;
> >> +
> >> +       if (!cgroup_lock_live_group(cgrp))
> >> +               return -ENODEV;
> >
> > Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> > to sg->tcp_prot_mem[*]?
> >
> >> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> >> +       u64 ret;
> >> +
> >> +       if (!cgroup_lock_live_group(cgrp))
> >> +               return -ENODEV;
> >
> > Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> > to sg->tcp_max_memory?
> 
> No, that is not my understanding. My understanding is this lock is 
> needed to protect against the cgroup just disappearing under our nose.
> 

Hm. reference count of dentry for cgroup isn't enough ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-26 11:02         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-26 11:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Greg Thelen, linux-kernel, paul, lizf, ebiederm, davem, netdev,
	linux-mm, kirill

On Sat, 24 Sep 2011 10:30:42 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/22/2011 03:01 AM, Greg Thelen wrote:
> > On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>  wrote:
> >> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
> >> +{
> >> +       return (mem == root_mem_cgroup);
> >> +}
> >> +
> >
> > Why are you adding a copy of mem_cgroup_is_root().  I see one already
> > in v3.0.  Was it deleted in a previous patch?
> 
> Already answered by another good samaritan.
> 
> >> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> >> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
> >> +       struct net *net = current->nsproxy->net_ns;
> >> +       int i;
> >> +
> >> +       if (!cgroup_lock_live_group(cgrp))
> >> +               return -ENODEV;
> >
> > Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> > to sg->tcp_prot_mem[*]?
> >
> >> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
> >> +       u64 ret;
> >> +
> >> +       if (!cgroup_lock_live_group(cgrp))
> >> +               return -ENODEV;
> >
> > Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
> > to sg->tcp_max_memory?
> 
> No, that is not my understanding. My understanding is this lock is 
> needed to protect against the cgroup just disappearing under our nose.
> 

Hm. reference count of dentry for cgroup isn't enough ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-19  0:56   ` Glauber Costa
@ 2011-09-26 14:39     ` Andrew Vagin
  -1 siblings, 0 replies; 119+ messages in thread
From: Andrew Vagin @ 2011-09-26 14:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

We can't change net.ipv4.tcp_mem if a cgroup with memory controller 
isn't mounted.

[root@dhcp-10-30-20-19 ~]# sysctl -w net.ipv4.tcp_mem="3 2 3"
error: "Invalid argument" setting key "net.ipv4.tcp_mem"

It's because tcp_max_memory is initialized in mem_cgroup_populate:

mem_cgroup_populate->register_kmem_files->sockets_populate->tcp_init_cgroup

> +int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +	struct proto *proto;
> +	int ret = 0;
> +
> +	read_lock(&proto_list_lock);
> +	list_for_each_entry(proto,&proto_list, node) {
> +		if (proto->init_cgroup)
> +			ret |= proto->init_cgroup(proto, cgrp, ss);
> +	}
> +	if (!ret)
> +		goto out;
> +
> +	list_for_each_entry_continue_reverse(proto,&proto_list, node)
> +		if (proto->destroy_cgroup)
> +			proto->destroy_cgroup(proto, cgrp, ss);
> +
> +out:
> +	read_unlock(&proto_list_lock);
> +	return ret;
> +}

> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>   	if (ret)
>   		return ret;
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	rcu_read_lock();
> +	cg = mem_cgroup_from_task(current);
> +	for (i = 0; i<  3; i++)
> +		if (vec[i]>  tcp_max_memory(cg)) {
> +			rcu_read_unlock();
> +			return -EINVAL;
> +		}


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-26 14:39     ` Andrew Vagin
  0 siblings, 0 replies; 119+ messages in thread
From: Andrew Vagin @ 2011-09-26 14:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

We can't change net.ipv4.tcp_mem if a cgroup with memory controller 
isn't mounted.

[root@dhcp-10-30-20-19 ~]# sysctl -w net.ipv4.tcp_mem="3 2 3"
error: "Invalid argument" setting key "net.ipv4.tcp_mem"

It's because tcp_max_memory is initialized in mem_cgroup_populate:

mem_cgroup_populate->register_kmem_files->sockets_populate->tcp_init_cgroup

> +int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +	struct proto *proto;
> +	int ret = 0;
> +
> +	read_lock(&proto_list_lock);
> +	list_for_each_entry(proto,&proto_list, node) {
> +		if (proto->init_cgroup)
> +			ret |= proto->init_cgroup(proto, cgrp, ss);
> +	}
> +	if (!ret)
> +		goto out;
> +
> +	list_for_each_entry_continue_reverse(proto,&proto_list, node)
> +		if (proto->destroy_cgroup)
> +			proto->destroy_cgroup(proto, cgrp, ss);
> +
> +out:
> +	read_unlock(&proto_list_lock);
> +	return ret;
> +}

> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>   	if (ret)
>   		return ret;
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	rcu_read_lock();
> +	cg = mem_cgroup_from_task(current);
> +	for (i = 0; i<  3; i++)
> +		if (vec[i]>  tcp_max_memory(cg)) {
> +			rcu_read_unlock();
> +			return -EINVAL;
> +		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-26 10:34     ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-26 22:44       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

>>   #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account		0
>> +#endif
>
>
> Hmm, do we really need this boot option ?
>  From my experience to have swap-accounting boot option,
> this scares us ;) I think config is enough.

If no one else wants it, I can remove it. I personally
don't need it, just wanted to follow the convention laid down by swap here.

>
>
>
>>   /*
>>    * Statistics for memory cgroup.
>>    */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>   	 */
>>   	struct res_counter memsw;
>>   	/*
>> +	 * the counter to account for kmem usage.
>> +	 */
>> +	struct res_counter kmem;
>> +	/*
>>   	 * Per cgroup active and inactive list, similar to the
>>   	 * per zone LRU lists.
>>   	 */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>   	 */
>>   	unsigned long 	move_charge_at_immigrate;
>>   	/*
>> +	 * Should kernel memory limits be stabilished independently
>> +	 * from user memory ?
>> +	 */
>> +	int		kmem_independent;
>> +	/*
>>   	 * percpu counter.
>>   	 */
>>   	struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> ok, nice clean up.
>
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(mem)) {
>> +		val = 0;
>> +		if (!mem->kmem_independent)
>> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>
>>   		if (!swap)
>> -			return res_counter_read_u64(&mem->res, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>>
>>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>   		else
>>   			val = res_counter_read_u64(&mem->memsw, name);
>>   		break;
>> +	case _KMEM:
>> +		val = res_counter_read_u64(&mem->kmem, name);
>> +		break;
>> +
>>   	default:
>>   		BUG();
>>   		break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>   	return 0;
>>   }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +	return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	cgroup_lock();
>> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +	cgroup_unlock();
>
> Hm. This code allows that parent/child can have different settings.
> Could you add parent-child check as..
>
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
> with parant's one."
Agree.
> How do you think ? I think a hierarchy must have the same config.
Yes, I think this is reasonable.

>
> BTW...I don't like naming a little ;)
>
> memory->consolidated/shared/?????_kmem_accounting ?
> Or
> memory->kmem_independent_accounting ?
>
> or some better naming ?

I can go with kmem_independent_accounting if you like, it is fine
by me.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-26 22:44       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

>>   #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account		0
>> +#endif
>
>
> Hmm, do we really need this boot option ?
>  From my experience to have swap-accounting boot option,
> this scares us ;) I think config is enough.

If no one else wants it, I can remove it. I personally
don't need it, just wanted to follow the convention laid down by swap here.

>
>
>
>>   /*
>>    * Statistics for memory cgroup.
>>    */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>   	 */
>>   	struct res_counter memsw;
>>   	/*
>> +	 * the counter to account for kmem usage.
>> +	 */
>> +	struct res_counter kmem;
>> +	/*
>>   	 * Per cgroup active and inactive list, similar to the
>>   	 * per zone LRU lists.
>>   	 */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>   	 */
>>   	unsigned long 	move_charge_at_immigrate;
>>   	/*
>> +	 * Should kernel memory limits be stabilished independently
>> +	 * from user memory ?
>> +	 */
>> +	int		kmem_independent;
>> +	/*
>>   	 * percpu counter.
>>   	 */
>>   	struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> ok, nice clean up.
>
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(mem)) {
>> +		val = 0;
>> +		if (!mem->kmem_independent)
>> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>
>>   		if (!swap)
>> -			return res_counter_read_u64(&mem->res, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>>
>>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>   		else
>>   			val = res_counter_read_u64(&mem->memsw, name);
>>   		break;
>> +	case _KMEM:
>> +		val = res_counter_read_u64(&mem->kmem, name);
>> +		break;
>> +
>>   	default:
>>   		BUG();
>>   		break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>   	return 0;
>>   }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +	return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	cgroup_lock();
>> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +	cgroup_unlock();
>
> Hm. This code allows that parent/child can have different settings.
> Could you add parent-child check as..
>
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
> with parant's one."
Agree.
> How do you think ? I think a hierarchy must have the same config.
Yes, I think this is reasonable.

>
> BTW...I don't like naming a little ;)
>
> memory->consolidated/shared/?????_kmem_accounting ?
> Or
> memory->kmem_independent_accounting ?
>
> or some better naming ?

I can go with kmem_independent_accounting if you like, it is fine
by me.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-26 22:44       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

>>   #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account		0
>> +#endif
>
>
> Hmm, do we really need this boot option ?
>  From my experience to have swap-accounting boot option,
> this scares us ;) I think config is enough.

If no one else wants it, I can remove it. I personally
don't need it, just wanted to follow the convention laid down by swap here.

>
>
>
>>   /*
>>    * Statistics for memory cgroup.
>>    */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>   	 */
>>   	struct res_counter memsw;
>>   	/*
>> +	 * the counter to account for kmem usage.
>> +	 */
>> +	struct res_counter kmem;
>> +	/*
>>   	 * Per cgroup active and inactive list, similar to the
>>   	 * per zone LRU lists.
>>   	 */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>   	 */
>>   	unsigned long 	move_charge_at_immigrate;
>>   	/*
>> +	 * Should kernel memory limits be stabilished independently
>> +	 * from user memory ?
>> +	 */
>> +	int		kmem_independent;
>> +	/*
>>   	 * percpu counter.
>>   	 */
>>   	struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> ok, nice clean up.
>
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(mem)) {
>> +		val = 0;
>> +		if (!mem->kmem_independent)
>> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>
>>   		if (!swap)
>> -			return res_counter_read_u64(&mem->res, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>>
>>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>   		else
>>   			val = res_counter_read_u64(&mem->memsw, name);
>>   		break;
>> +	case _KMEM:
>> +		val = res_counter_read_u64(&mem->kmem, name);
>> +		break;
>> +
>>   	default:
>>   		BUG();
>>   		break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>   	return 0;
>>   }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +	return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	cgroup_lock();
>> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +	cgroup_unlock();
>
> Hm. This code allows that parent/child can have different settings.
> Could you add parent-child check as..
>
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
> with parant's one."
Agree.
> How do you think ? I think a hierarchy must have the same config.
Yes, I think this is reasonable.

>
> BTW...I don't like naming a little ;)
>
> memory->consolidated/shared/?????_kmem_accounting ?
> Or
> memory->kmem_independent_accounting ?
>
> or some better naming ?

I can go with kmem_independent_accounting if you like, it is fine
by me.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-26 10:52               ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-26 22:47                 ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.

I am working on something here that may allow it.
But I think it is independent of the rest, and I can repost the series 
fixing the problems raised here without it, + EXPERIMENTAL.

Btw, using EXPERIMENTAL here is a very good idea. I think that we should
turn EXPERIMENTAL on even if I fix for that exists, for a least a couple
of months until we see how this thing really evolves.

What do you think?

>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
>    - owner task of socket
>    - account moving callback
>
> Or disallow task moving once accounted.

I personally think disallowing task movement once accounted is 
reasonable. At least for starters.

I think I can add at least that to the next proposal. Famous last words 
is, it should not be that hard...


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-26 22:47                 ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.

I am working on something here that may allow it.
But I think it is independent of the rest, and I can repost the series 
fixing the problems raised here without it, + EXPERIMENTAL.

Btw, using EXPERIMENTAL here is a very good idea. I think that we should
turn EXPERIMENTAL on even if I fix for that exists, for a least a couple
of months until we see how this thing really evolves.

What do you think?

>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
>    - owner task of socket
>    - account moving callback
>
> Or disallow task moving once accounted.

I personally think disallowing task movement once accounted is 
reasonable. At least for starters.

I think I can add at least that to the next proposal. Famous last words 
is, it should not be that hard...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-26 22:47                 ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.

I am working on something here that may allow it.
But I think it is independent of the rest, and I can repost the series 
fixing the problems raised here without it, + EXPERIMENTAL.

Btw, using EXPERIMENTAL here is a very good idea. I think that we should
turn EXPERIMENTAL on even if I fix for that exists, for a least a couple
of months until we see how this thing really evolves.

What do you think?

>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
>    - owner task of socket
>    - account moving callback
>
> Or disallow task moving once accounted.

I personally think disallowing task movement once accounted is 
reasonable. At least for starters.

I think I can add at least that to the next proposal. Famous last words 
is, it should not be that hard...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-26 10:59     ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-26 22:48       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> a comment below.
>
>> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>> +		    struct cgroup_subsys *ss)
>> +{
>> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +	unsigned long limit;
>> +
>> +	cg->tcp_memory_pressure = 0;
>> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
>> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
>> +
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +
>> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
>> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
>> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
>> +
>
> Then, the parameter doesn't inherit parent's one ?
>
> I think sockets_populate should pass 'parent' and
>
>
> I think you should have a function
>
>      mem_cgroup_should_inherit_parent_settings(parent)
>
> (This is because you made this feature as a part of memcg.
>   please provide expected behavior.)
>
All right Kame, will do.
Thanks.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-26 22:48       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> a comment below.
>
>> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>> +		    struct cgroup_subsys *ss)
>> +{
>> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +	unsigned long limit;
>> +
>> +	cg->tcp_memory_pressure = 0;
>> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
>> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
>> +
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +
>> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
>> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
>> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
>> +
>
> Then, the parameter doesn't inherit parent's one ?
>
> I think sockets_populate should pass 'parent' and
>
>
> I think you should have a function
>
>      mem_cgroup_should_inherit_parent_settings(parent)
>
> (This is because you made this feature as a part of memcg.
>   please provide expected behavior.)
>
All right Kame, will do.
Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-26 22:48       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> a comment below.
>
>> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>> +		    struct cgroup_subsys *ss)
>> +{
>> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +	unsigned long limit;
>> +
>> +	cg->tcp_memory_pressure = 0;
>> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
>> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
>> +
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +
>> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
>> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
>> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
>> +
>
> Then, the parameter doesn't inherit parent's one ?
>
> I think sockets_populate should pass 'parent' and
>
>
> I think you should have a function
>
>      mem_cgroup_should_inherit_parent_settings(parent)
>
> (This is because you made this feature as a part of memcg.
>   please provide expected behavior.)
>
All right Kame, will do.
Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-26 11:02         ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-26 22:49           ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, linux-kernel, paul, lizf, ebiederm, davem, netdev,
	linux-mm, kirill

On 09/26/2011 08:02 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 10:30:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 03:01 AM, Greg Thelen wrote:
>>> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>   wrote:
>>>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>>>> +{
>>>> +       return (mem == root_mem_cgroup);
>>>> +}
>>>> +
>>>
>>> Why are you adding a copy of mem_cgroup_is_root().  I see one already
>>> in v3.0.  Was it deleted in a previous patch?
>>
>> Already answered by another good samaritan.
>>
>>>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>>>> +{
>>>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>>>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>>>> +       struct net *net = current->nsproxy->net_ns;
>>>> +       int i;
>>>> +
>>>> +       if (!cgroup_lock_live_group(cgrp))
>>>> +               return -ENODEV;
>>>
>>> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
>>> to sg->tcp_prot_mem[*]?
>>>
>>>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>>>> +{
>>>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>>>> +       u64 ret;
>>>> +
>>>> +       if (!cgroup_lock_live_group(cgrp))
>>>> +               return -ENODEV;
>>>
>>> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
>>> to sg->tcp_max_memory?
>>
>> No, that is not my understanding. My understanding is this lock is
>> needed to protect against the cgroup just disappearing under our nose.
>>
>
> Hm. reference count of dentry for cgroup isn't enough ?
>
> Thanks,
> -Kame
>
think think think think think think...
Yeah, I guess it is.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-26 22:49           ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, linux-kernel, paul, lizf, ebiederm, davem, netdev,
	linux-mm, kirill

On 09/26/2011 08:02 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 10:30:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 03:01 AM, Greg Thelen wrote:
>>> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>   wrote:
>>>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>>>> +{
>>>> +       return (mem == root_mem_cgroup);
>>>> +}
>>>> +
>>>
>>> Why are you adding a copy of mem_cgroup_is_root().  I see one already
>>> in v3.0.  Was it deleted in a previous patch?
>>
>> Already answered by another good samaritan.
>>
>>>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>>>> +{
>>>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>>>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>>>> +       struct net *net = current->nsproxy->net_ns;
>>>> +       int i;
>>>> +
>>>> +       if (!cgroup_lock_live_group(cgrp))
>>>> +               return -ENODEV;
>>>
>>> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
>>> to sg->tcp_prot_mem[*]?
>>>
>>>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>>>> +{
>>>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>>>> +       u64 ret;
>>>> +
>>>> +       if (!cgroup_lock_live_group(cgrp))
>>>> +               return -ENODEV;
>>>
>>> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
>>> to sg->tcp_max_memory?
>>
>> No, that is not my understanding. My understanding is this lock is
>> needed to protect against the cgroup just disappearing under our nose.
>>
>
> Hm. reference count of dentry for cgroup isn't enough ?
>
> Thanks,
> -Kame
>
think think think think think think...
Yeah, I guess it is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-26 22:49           ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, linux-kernel, paul, lizf, ebiederm, davem, netdev,
	linux-mm, kirill

On 09/26/2011 08:02 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 10:30:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 03:01 AM, Greg Thelen wrote:
>>> On Sun, Sep 18, 2011 at 5:56 PM, Glauber Costa<glommer@parallels.com>   wrote:
>>>> +static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
>>>> +{
>>>> +       return (mem == root_mem_cgroup);
>>>> +}
>>>> +
>>>
>>> Why are you adding a copy of mem_cgroup_is_root().  I see one already
>>> in v3.0.  Was it deleted in a previous patch?
>>
>> Already answered by another good samaritan.
>>
>>>> +static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
>>>> +{
>>>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>>>> +       struct mem_cgroup *parent = parent_mem_cgroup(sg);
>>>> +       struct net *net = current->nsproxy->net_ns;
>>>> +       int i;
>>>> +
>>>> +       if (!cgroup_lock_live_group(cgrp))
>>>> +               return -ENODEV;
>>>
>>> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
>>> to sg->tcp_prot_mem[*]?
>>>
>>>> +static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
>>>> +{
>>>> +       struct mem_cgroup *sg = mem_cgroup_from_cont(cgrp);
>>>> +       u64 ret;
>>>> +
>>>> +       if (!cgroup_lock_live_group(cgrp))
>>>> +               return -ENODEV;
>>>
>>> Why is cgroup_lock_live_cgroup() needed here?  Does it protect updates
>>> to sg->tcp_max_memory?
>>
>> No, that is not my understanding. My understanding is this lock is
>> needed to protect against the cgroup just disappearing under our nose.
>>
>
> Hm. reference count of dentry for cgroup isn't enough ?
>
> Thanks,
> -Kame
>
think think think think think think...
Yeah, I guess it is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-26 14:39     ` Andrew Vagin
  (?)
@ 2011-09-26 22:52       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:52 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/26/2011 11:39 AM, Andrew Vagin wrote:
> We can't change net.ipv4.tcp_mem if a cgroup with memory controller
> isn't mounted.
>
> [root@dhcp-10-30-20-19 ~]# sysctl -w net.ipv4.tcp_mem="3 2 3"
> error: "Invalid argument" setting key "net.ipv4.tcp_mem"
>
> It's because tcp_max_memory is initialized in mem_cgroup_populate:
>
> mem_cgroup_populate->register_kmem_files->sockets_populate->tcp_init_cgroup

Thank you, will fix it

>> +int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
>> +{
>> + struct proto *proto;
>> + int ret = 0;
>> +
>> + read_lock(&proto_list_lock);
>> + list_for_each_entry(proto,&proto_list, node) {
>> + if (proto->init_cgroup)
>> + ret |= proto->init_cgroup(proto, cgrp, ss);
>> + }
>> + if (!ret)
>> + goto out;
>> +
>> + list_for_each_entry_continue_reverse(proto,&proto_list, node)
>> + if (proto->destroy_cgroup)
>> + proto->destroy_cgroup(proto, cgrp, ss);
>> +
>> +out:
>> + read_unlock(&proto_list_lock);
>> + return ret;
>> +}
>
>> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> if (ret)
>> return ret;
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> + rcu_read_lock();
>> + cg = mem_cgroup_from_task(current);
>> + for (i = 0; i< 3; i++)
>> + if (vec[i]> tcp_max_memory(cg)) {
>> + rcu_read_unlock();
>> + return -EINVAL;
>> + }
>


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-26 22:52       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:52 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/26/2011 11:39 AM, Andrew Vagin wrote:
> We can't change net.ipv4.tcp_mem if a cgroup with memory controller
> isn't mounted.
>
> [root@dhcp-10-30-20-19 ~]# sysctl -w net.ipv4.tcp_mem="3 2 3"
> error: "Invalid argument" setting key "net.ipv4.tcp_mem"
>
> It's because tcp_max_memory is initialized in mem_cgroup_populate:
>
> mem_cgroup_populate->register_kmem_files->sockets_populate->tcp_init_cgroup

Thank you, will fix it

>> +int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
>> +{
>> + struct proto *proto;
>> + int ret = 0;
>> +
>> + read_lock(&proto_list_lock);
>> + list_for_each_entry(proto,&proto_list, node) {
>> + if (proto->init_cgroup)
>> + ret |= proto->init_cgroup(proto, cgrp, ss);
>> + }
>> + if (!ret)
>> + goto out;
>> +
>> + list_for_each_entry_continue_reverse(proto,&proto_list, node)
>> + if (proto->destroy_cgroup)
>> + proto->destroy_cgroup(proto, cgrp, ss);
>> +
>> +out:
>> + read_unlock(&proto_list_lock);
>> + return ret;
>> +}
>
>> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> if (ret)
>> return ret;
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> + rcu_read_lock();
>> + cg = mem_cgroup_from_task(current);
>> + for (i = 0; i< 3; i++)
>> + if (vec[i]> tcp_max_memory(cg)) {
>> + rcu_read_unlock();
>> + return -EINVAL;
>> + }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-26 22:52       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 22:52 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill

On 09/26/2011 11:39 AM, Andrew Vagin wrote:
> We can't change net.ipv4.tcp_mem if a cgroup with memory controller
> isn't mounted.
>
> [root@dhcp-10-30-20-19 ~]# sysctl -w net.ipv4.tcp_mem="3 2 3"
> error: "Invalid argument" setting key "net.ipv4.tcp_mem"
>
> It's because tcp_max_memory is initialized in mem_cgroup_populate:
>
> mem_cgroup_populate->register_kmem_files->sockets_populate->tcp_init_cgroup

Thank you, will fix it

>> +int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
>> +{
>> + struct proto *proto;
>> + int ret = 0;
>> +
>> + read_lock(&proto_list_lock);
>> + list_for_each_entry(proto,&proto_list, node) {
>> + if (proto->init_cgroup)
>> + ret |= proto->init_cgroup(proto, cgrp, ss);
>> + }
>> + if (!ret)
>> + goto out;
>> +
>> + list_for_each_entry_continue_reverse(proto,&proto_list, node)
>> + if (proto->destroy_cgroup)
>> + proto->destroy_cgroup(proto, cgrp, ss);
>> +
>> +out:
>> + read_unlock(&proto_list_lock);
>> + return ret;
>> +}
>
>> @@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> if (ret)
>> return ret;
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> + rcu_read_lock();
>> + cg = mem_cgroup_from_task(current);
>> + for (i = 0; i< 3; i++)
>> + if (vec[i]> tcp_max_memory(cg)) {
>> + rcu_read_unlock();
>> + return -EINVAL;
>> + }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-26 10:34     ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-26 23:18       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 23:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:39 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch lays down the foundation for the kernel memory component
>> of the Memory Controller.
>>
>> As of today, I am only laying down the following files:
>>
>>   * memory.independent_kmem_limit
>>   * memory.kmem.limit_in_bytes (currently ignored)
>>   * memory.kmem.usage_in_bytes (always zero)
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Menage<paul@paulmenage.org>
>> CC: Greg Thelen<gthelen@google.com>
>
> I'm sorry that my slow review is delaying you.
>
>
>> ---
>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>   init/Kconfig                     |   11 ++++
>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f3c598..6f1954a 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -44,8 +44,9 @@ Features:
>>    - oom-killer disable knob and oom-notifier
>>    - Root cgroup has no limit controls.
>>
>> - Kernel memory and Hugepages are not under control yet. We just manage
>> - pages on LRU. To add more controls, we have to take care of performance.
>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>> + controls, we have to take care of performance. Kernel memory support is work
>> + in progress, and the current version provides basically functionality.
>>
>>   Brief summary of control files.
>>
>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>   				 (See 5.5 for details)
>>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>>   				 (See 5.5 for details)
>> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
>> +				 (See 2.7 for details)
>>    memory.limit_in_bytes		 # set/show limit of memory usage
>>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
>> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>>    memory.failcnt			 # show the number of memory usage hits limits
>>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>>    memory.max_usage_in_bytes	 # show max memory usage recorded
>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>    memory.oom_control		 # set/show oom controls.
>>    memory.numa_stat		 # show the number of memory usage per numa node
>>
>> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
>> +				   independent of user limits
>> +
>>   1. History
>>
>>   The memory controller has a long history. A request for comments for the memory
>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>     zone->lru_lock, it has no lock of its own.
>>
>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>> +
>> + With the Kernel memory extension, the Memory Controller is able to limit
>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>> +different than user memory, since it can't be swapped out, which makes it
>> +possible to DoS the system by consuming too much of this precious resource.
>> +Kernel memory limits are not imposed for the root cgroup.
>> +
>> +Memory limits as specified by the standard Memory Controller may or may not
>> +take kernel memory into consideration. This is achieved through the file
>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>> +memory to be controlled separately.
>> +
>> +When kernel memory limits are not independent, the limit values set in
>> +memory.kmem files are ignored.
>> +
>> +Currently no soft limit is implemented for kernel memory. It is future work
>> +to trigger slab reclaim when those limits are reached.
>> +
>>   3. User Interface
>>
>>   0. Configuration
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..49e5839 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>   	  For those who want to have the feature enabled by default should
>>   	  select this option (if, for some reason, they need to disable it
>>   	  then swapaccount=0 does the trick).
>> +config CGROUP_MEM_RES_CTLR_KMEM
>> +	bool "Memory Resource Controller Kernel Memory accounting"
>> +	depends on CGROUP_MEM_RES_CTLR
>> +	default y
>> +	help
>> +	  The Kernel Memory extension for Memory Resource Controller can limit
>> +	  the amount of memory used by kernel objects in the system. Those are
>> +	  fundamentally different from the entities handled by the standard
>> +	  Memory Controller, which are page-based, and can be swapped. Users of
>> +	  the kmem extension can use it to guarantee that no group of processes
>> +	  will ever exhaust kernel resources alone.
>>
>>   config CGROUP_PERF
>>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebd1e86..d32e931 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>   #define do_swap_account		(0)
>>   #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account		0
>> +#endif
>
>
> Hmm, do we really need this boot option ?
>  From my experience to have swap-accounting boot option,
> this scares us ;) I think config is enough.
>
>
>
>
>>   /*
>>    * Statistics for memory cgroup.
>>    */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>   	 */
>>   	struct res_counter memsw;
>>   	/*
>> +	 * the counter to account for kmem usage.
>> +	 */
>> +	struct res_counter kmem;
>> +	/*
>>   	 * Per cgroup active and inactive list, similar to the
>>   	 * per zone LRU lists.
>>   	 */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>   	 */
>>   	unsigned long 	move_charge_at_immigrate;
>>   	/*
>> +	 * Should kernel memory limits be stabilished independently
>> +	 * from user memory ?
>> +	 */
>> +	int		kmem_independent;
>> +	/*
>>   	 * percpu counter.
>>   	 */
>>   	struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> ok, nice clean up.
>
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(mem)) {
>> +		val = 0;
>> +		if (!mem->kmem_independent)
>> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>
>>   		if (!swap)
>> -			return res_counter_read_u64(&mem->res, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>>
>>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>   		else
>>   			val = res_counter_read_u64(&mem->memsw, name);
>>   		break;
>> +	case _KMEM:
>> +		val = res_counter_read_u64(&mem->kmem, name);
>> +		break;
>> +
>>   	default:
>>   		BUG();
>>   		break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>   	return 0;
>>   }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +	return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	cgroup_lock();
>> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +	cgroup_unlock();
>
> Hm. This code allows that parent/child can have different settings.
> Could you add parent-child check as..
>
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
> with parant's one."
>
> How do you think ? I think a hierarchy must have the same config.
BTW, Kame:

Look again (I forgot myself when I first replied to you)
Only in the root cgroup those files get registered.
So shouldn't be a problem, because children won't even
be able to see them.

Do you agree with this ?

>
> BTW...I don't like naming a little ;)
>
> memory->consolidated/shared/?????_kmem_accounting ?
> Or
> memory->kmem_independent_accounting ?
>
> or some better naming ?
>
> Thanks,
> -Kame
>


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-26 23:18       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 23:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:39 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch lays down the foundation for the kernel memory component
>> of the Memory Controller.
>>
>> As of today, I am only laying down the following files:
>>
>>   * memory.independent_kmem_limit
>>   * memory.kmem.limit_in_bytes (currently ignored)
>>   * memory.kmem.usage_in_bytes (always zero)
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Menage<paul@paulmenage.org>
>> CC: Greg Thelen<gthelen@google.com>
>
> I'm sorry that my slow review is delaying you.
>
>
>> ---
>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>   init/Kconfig                     |   11 ++++
>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f3c598..6f1954a 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -44,8 +44,9 @@ Features:
>>    - oom-killer disable knob and oom-notifier
>>    - Root cgroup has no limit controls.
>>
>> - Kernel memory and Hugepages are not under control yet. We just manage
>> - pages on LRU. To add more controls, we have to take care of performance.
>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>> + controls, we have to take care of performance. Kernel memory support is work
>> + in progress, and the current version provides basically functionality.
>>
>>   Brief summary of control files.
>>
>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>   				 (See 5.5 for details)
>>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>>   				 (See 5.5 for details)
>> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
>> +				 (See 2.7 for details)
>>    memory.limit_in_bytes		 # set/show limit of memory usage
>>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
>> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>>    memory.failcnt			 # show the number of memory usage hits limits
>>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>>    memory.max_usage_in_bytes	 # show max memory usage recorded
>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>    memory.oom_control		 # set/show oom controls.
>>    memory.numa_stat		 # show the number of memory usage per numa node
>>
>> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
>> +				   independent of user limits
>> +
>>   1. History
>>
>>   The memory controller has a long history. A request for comments for the memory
>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>     zone->lru_lock, it has no lock of its own.
>>
>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>> +
>> + With the Kernel memory extension, the Memory Controller is able to limit
>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>> +different than user memory, since it can't be swapped out, which makes it
>> +possible to DoS the system by consuming too much of this precious resource.
>> +Kernel memory limits are not imposed for the root cgroup.
>> +
>> +Memory limits as specified by the standard Memory Controller may or may not
>> +take kernel memory into consideration. This is achieved through the file
>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>> +memory to be controlled separately.
>> +
>> +When kernel memory limits are not independent, the limit values set in
>> +memory.kmem files are ignored.
>> +
>> +Currently no soft limit is implemented for kernel memory. It is future work
>> +to trigger slab reclaim when those limits are reached.
>> +
>>   3. User Interface
>>
>>   0. Configuration
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..49e5839 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>   	  For those who want to have the feature enabled by default should
>>   	  select this option (if, for some reason, they need to disable it
>>   	  then swapaccount=0 does the trick).
>> +config CGROUP_MEM_RES_CTLR_KMEM
>> +	bool "Memory Resource Controller Kernel Memory accounting"
>> +	depends on CGROUP_MEM_RES_CTLR
>> +	default y
>> +	help
>> +	  The Kernel Memory extension for Memory Resource Controller can limit
>> +	  the amount of memory used by kernel objects in the system. Those are
>> +	  fundamentally different from the entities handled by the standard
>> +	  Memory Controller, which are page-based, and can be swapped. Users of
>> +	  the kmem extension can use it to guarantee that no group of processes
>> +	  will ever exhaust kernel resources alone.
>>
>>   config CGROUP_PERF
>>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebd1e86..d32e931 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>   #define do_swap_account		(0)
>>   #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account		0
>> +#endif
>
>
> Hmm, do we really need this boot option ?
>  From my experience to have swap-accounting boot option,
> this scares us ;) I think config is enough.
>
>
>
>
>>   /*
>>    * Statistics for memory cgroup.
>>    */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>   	 */
>>   	struct res_counter memsw;
>>   	/*
>> +	 * the counter to account for kmem usage.
>> +	 */
>> +	struct res_counter kmem;
>> +	/*
>>   	 * Per cgroup active and inactive list, similar to the
>>   	 * per zone LRU lists.
>>   	 */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>   	 */
>>   	unsigned long 	move_charge_at_immigrate;
>>   	/*
>> +	 * Should kernel memory limits be stabilished independently
>> +	 * from user memory ?
>> +	 */
>> +	int		kmem_independent;
>> +	/*
>>   	 * percpu counter.
>>   	 */
>>   	struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> ok, nice clean up.
>
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(mem)) {
>> +		val = 0;
>> +		if (!mem->kmem_independent)
>> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>
>>   		if (!swap)
>> -			return res_counter_read_u64(&mem->res, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>>
>>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>   		else
>>   			val = res_counter_read_u64(&mem->memsw, name);
>>   		break;
>> +	case _KMEM:
>> +		val = res_counter_read_u64(&mem->kmem, name);
>> +		break;
>> +
>>   	default:
>>   		BUG();
>>   		break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>   	return 0;
>>   }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +	return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	cgroup_lock();
>> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +	cgroup_unlock();
>
> Hm. This code allows that parent/child can have different settings.
> Could you add parent-child check as..
>
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
> with parant's one."
>
> How do you think ? I think a hierarchy must have the same config.
BTW, Kame:

Look again (I forgot myself when I first replied to you)
Only in the root cgroup those files get registered.
So shouldn't be a problem, because children won't even
be able to see them.

Do you agree with this ?

>
> BTW...I don't like naming a little ;)
>
> memory->consolidated/shared/?????_kmem_accounting ?
> Or
> memory->kmem_independent_accounting ?
>
> or some better naming ?
>
> Thanks,
> -Kame
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-26 23:18       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-26 23:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:39 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch lays down the foundation for the kernel memory component
>> of the Memory Controller.
>>
>> As of today, I am only laying down the following files:
>>
>>   * memory.independent_kmem_limit
>>   * memory.kmem.limit_in_bytes (currently ignored)
>>   * memory.kmem.usage_in_bytes (always zero)
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Paul Menage<paul@paulmenage.org>
>> CC: Greg Thelen<gthelen@google.com>
>
> I'm sorry that my slow review is delaying you.
>
>
>> ---
>>   Documentation/cgroups/memory.txt |   30 +++++++++-
>>   init/Kconfig                     |   11 ++++
>>   mm/memcontrol.c                  |  115 ++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 148 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 6f3c598..6f1954a 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -44,8 +44,9 @@ Features:
>>    - oom-killer disable knob and oom-notifier
>>    - Root cgroup has no limit controls.
>>
>> - Kernel memory and Hugepages are not under control yet. We just manage
>> - pages on LRU. To add more controls, we have to take care of performance.
>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>> + controls, we have to take care of performance. Kernel memory support is work
>> + in progress, and the current version provides basically functionality.
>>
>>   Brief summary of control files.
>>
>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>   				 (See 5.5 for details)
>>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>>   				 (See 5.5 for details)
>> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
>> +				 (See 2.7 for details)
>>    memory.limit_in_bytes		 # set/show limit of memory usage
>>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
>> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>>    memory.failcnt			 # show the number of memory usage hits limits
>>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>>    memory.max_usage_in_bytes	 # show max memory usage recorded
>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>    memory.oom_control		 # set/show oom controls.
>>    memory.numa_stat		 # show the number of memory usage per numa node
>>
>> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
>> +				   independent of user limits
>> +
>>   1. History
>>
>>   The memory controller has a long history. A request for comments for the memory
>> @@ -255,6 +262,25 @@ When oom event notifier is registered, event will be delivered.
>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>     zone->lru_lock, it has no lock of its own.
>>
>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>> +
>> + With the Kernel memory extension, the Memory Controller is able to limit
>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>> +different than user memory, since it can't be swapped out, which makes it
>> +possible to DoS the system by consuming too much of this precious resource.
>> +Kernel memory limits are not imposed for the root cgroup.
>> +
>> +Memory limits as specified by the standard Memory Controller may or may not
>> +take kernel memory into consideration. This is achieved through the file
>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>> +memory to be controlled separately.
>> +
>> +When kernel memory limits are not independent, the limit values set in
>> +memory.kmem files are ignored.
>> +
>> +Currently no soft limit is implemented for kernel memory. It is future work
>> +to trigger slab reclaim when those limits are reached.
>> +
>>   3. User Interface
>>
>>   0. Configuration
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d627783..49e5839 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>   	  For those who want to have the feature enabled by default should
>>   	  select this option (if, for some reason, they need to disable it
>>   	  then swapaccount=0 does the trick).
>> +config CGROUP_MEM_RES_CTLR_KMEM
>> +	bool "Memory Resource Controller Kernel Memory accounting"
>> +	depends on CGROUP_MEM_RES_CTLR
>> +	default y
>> +	help
>> +	  The Kernel Memory extension for Memory Resource Controller can limit
>> +	  the amount of memory used by kernel objects in the system. Those are
>> +	  fundamentally different from the entities handled by the standard
>> +	  Memory Controller, which are page-based, and can be swapped. Users of
>> +	  the kmem extension can use it to guarantee that no group of processes
>> +	  will ever exhaust kernel resources alone.
>>
>>   config CGROUP_PERF
>>   	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebd1e86..d32e931 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -73,7 +73,11 @@ static int really_do_swap_account __initdata = 0;
>>   #define do_swap_account		(0)
>>   #endif
>>
>> -
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +int do_kmem_account __read_mostly = 1;
>> +#else
>> +#define do_kmem_account		0
>> +#endif
>
>
> Hmm, do we really need this boot option ?
>  From my experience to have swap-accounting boot option,
> this scares us ;) I think config is enough.
>
>
>
>
>>   /*
>>    * Statistics for memory cgroup.
>>    */
>> @@ -270,6 +274,10 @@ struct mem_cgroup {
>>   	 */
>>   	struct res_counter memsw;
>>   	/*
>> +	 * the counter to account for kmem usage.
>> +	 */
>> +	struct res_counter kmem;
>> +	/*
>>   	 * Per cgroup active and inactive list, similar to the
>>   	 * per zone LRU lists.
>>   	 */
>> @@ -321,6 +329,11 @@ struct mem_cgroup {
>>   	 */
>>   	unsigned long 	move_charge_at_immigrate;
>>   	/*
>> +	 * Should kernel memory limits be stabilished independently
>> +	 * from user memory ?
>> +	 */
>> +	int		kmem_independent;
>> +	/*
>>   	 * percpu counter.
>>   	 */
>>   	struct mem_cgroup_stat_cpu *stat;
>> @@ -388,9 +401,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> ok, nice clean up.
>
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3943,10 +3961,15 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *mem, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(mem)) {
>> +		val = 0;
>> +		if (!mem->kmem_independent)
>> +			val = res_counter_read_u64(&mem->kmem, RES_USAGE);
>
>>   		if (!swap)
>> -			return res_counter_read_u64(&mem->res, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&mem->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>>
>>   	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
>> @@ -3979,6 +4002,10 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>>   		else
>>   			val = res_counter_read_u64(&mem->memsw, name);
>>   		break;
>> +	case _KMEM:
>> +		val = res_counter_read_u64(&mem->kmem, name);
>> +		break;
>> +
>>   	default:
>>   		BUG();
>>   		break;
>> @@ -4756,6 +4783,21 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
>>   	return 0;
>>   }
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +static u64 kmem_limit_independent_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> +	return mem_cgroup_from_cont(cont)->kmem_independent;
>> +}
>> +
>> +static int kmem_limit_independent_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	cgroup_lock();
>> +	mem_cgroup_from_cont(cont)->kmem_independent = !!val;
>> +	cgroup_unlock();
>
> Hm. This code allows that parent/child can have different settings.
> Could you add parent-child check as..
>
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
> with parant's one."
>
> How do you think ? I think a hierarchy must have the same config.
BTW, Kame:

Look again (I forgot myself when I first replied to you)
Only in the root cgroup those files get registered.
So shouldn't be a problem, because children won't even
be able to see them.

Do you agree with this ?

>
> BTW...I don't like naming a little ;)
>
> memory->consolidated/shared/?????_kmem_accounting ?
> Or
> memory->kmem_independent_accounting ?
>
> or some better naming ?
>
> Thanks,
> -Kame
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-26 10:59     ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-27  1:53       ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-27  1:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> a comment below.
>
>> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>> +		    struct cgroup_subsys *ss)
>> +{
>> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +	unsigned long limit;
>> +
>> +	cg->tcp_memory_pressure = 0;
>> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
>> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
>> +
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +
>> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
>> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
>> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
>> +
>
> Then, the parameter doesn't inherit parent's one ?
>
> I think sockets_populate should pass 'parent' and
>
>
> I think you should have a function
>
>      mem_cgroup_should_inherit_parent_settings(parent)
>
> (This is because you made this feature as a part of memcg.
>   please provide expected behavior.)
>
> Thanks,
> -Kame

Kame: Another look into this:

sysctl_tcp_mem is a global value, unless you have different namespaces.
So it is either global anyway, or should come from the namespace, not 
the parent.

Now, the goal here is to set the maximum possible value for those 
fields. That, indeed, should come from the parent.

That's my understanding...


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-27  1:53       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-27  1:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> a comment below.
>
>> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>> +		    struct cgroup_subsys *ss)
>> +{
>> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +	unsigned long limit;
>> +
>> +	cg->tcp_memory_pressure = 0;
>> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
>> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
>> +
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +
>> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
>> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
>> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
>> +
>
> Then, the parameter doesn't inherit parent's one ?
>
> I think sockets_populate should pass 'parent' and
>
>
> I think you should have a function
>
>      mem_cgroup_should_inherit_parent_settings(parent)
>
> (This is because you made this feature as a part of memcg.
>   please provide expected behavior.)
>
> Thanks,
> -Kame

Kame: Another look into this:

sysctl_tcp_mem is a global value, unless you have different namespaces.
So it is either global anyway, or should come from the namespace, not 
the parent.

Now, the goal here is to set the maximum possible value for those 
fields. That, indeed, should come from the parent.

That's my understanding...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-27  1:53       ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-27  1:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> On Sun, 18 Sep 2011 21:56:42 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> a comment below.
>
>> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
>> +		    struct cgroup_subsys *ss)
>> +{
>> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
>> +	unsigned long limit;
>> +
>> +	cg->tcp_memory_pressure = 0;
>> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
>> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
>> +
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +
>> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
>> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
>> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
>> +
>
> Then, the parameter doesn't inherit parent's one ?
>
> I think sockets_populate should pass 'parent' and
>
>
> I think you should have a function
>
>      mem_cgroup_should_inherit_parent_settings(parent)
>
> (This is because you made this feature as a part of memcg.
>   please provide expected behavior.)
>
> Thanks,
> -Kame

Kame: Another look into this:

sysctl_tcp_mem is a global value, unless you have different namespaces.
So it is either global anyway, or should come from the namespace, not 
the parent.

Now, the goal here is to set the maximum possible value for those 
fields. That, indeed, should come from the parent.

That's my understanding...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-24 14:43         ` Glauber Costa
@ 2011-09-27 10:06           ` Balbir Singh
  -1 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-27 10:06 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

>> I know we have a lot of pending xxx_from_cont() and struct cgroup
>> *cont, can we move it to memcg notation to be more consistent with our
>> usage. There is a patch to convert old usage
>>
>
> Hello Balbir, I missed this comment. What exactly do you propose in this
> patch, since I have to assume that the patch you talk about is not applied?
> Is it just a change to the parameter name that you propose?
>

Yes, it is a patch posted on linux-mm by raghavendra

Balbir Singh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-27 10:06           ` Balbir Singh
  0 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-27 10:06 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, kamezawa.hiroyu, ebiederm, davem,
	gthelen, netdev, linux-mm, kirill, Ying Han

>> I know we have a lot of pending xxx_from_cont() and struct cgroup
>> *cont, can we move it to memcg notation to be more consistent with our
>> usage. There is a patch to convert old usage
>>
>
> Hello Balbir, I missed this comment. What exactly do you propose in this
> patch, since I have to assume that the patch you talk about is not applied?
> Is it just a change to the parameter name that you propose?
>

Yes, it is a patch posted on linux-mm by raghavendra

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-26 10:52               ` KAMEZAWA Hiroyuki
@ 2011-09-27 20:43                 ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-27 20:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

[-- Attachment #1: Type: text/plain, Size: 2758 bytes --]

On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.
>
>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
>    - owner task of socket
>    - account moving callback
>
> Or disallow task moving once accounted.
>

So,

I tried to come up with proper task charge moving here, and the locking 
easily gets quite complicated. (But I have the feeling I am overlooking 
something...) So I think I'll really need more time for that.

What do you guys think of this following patch, + EXPERIMENTAL ?


[-- Attachment #2: foo.patch --]
[-- Type: text/plain, Size: 3232 bytes --]

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f784cb7..684c090 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -257,6 +257,7 @@ struct mem_cgroup;
 struct tcp_memcontrol {
 	/* per-cgroup tcp memory pressure knobs */
 	int tcp_max_memory;
+	atomic_t refcnt;
 	atomic_long_t tcp_memory_allocated;
 	struct percpu_counter tcp_sockets_allocated;
 	/* those two are read-mostly, leave them at the end */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6937f20..b594a9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -361,34 +361,21 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
 void sock_update_memcg(struct sock *sk)
 {
-	/* right now a socket spends its whole life in the same cgroup */
-	BUG_ON(sk->sk_cgrp);
-
 	rcu_read_lock();
 	sk->sk_cgrp = mem_cgroup_from_task(current);
-
-	/*
-	 * We don't need to protect against anything task-related, because
-	 * we are basically stuck with the sock pointer that won't change,
-	 * even if the task that originated the socket changes cgroups.
-	 *
-	 * What we do have to guarantee, is that the chain leading us to
-	 * the top level won't change under our noses. Incrementing the
-	 * reference count via cgroup_exclude_rmdir guarantees that.
-	 */
-	cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
 	rcu_read_unlock();
 }
 
 void sock_release_memcg(struct sock *sk)
 {
-	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
 }
 
 void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
 			  int amt, int *parent_failure)
 {
+	atomic_inc(&mem->tcp.refcnt);
 	mem = parent_mem_cgroup(mem);
+
 	for (; mem != NULL; mem = parent_mem_cgroup(mem)) {
 		long alloc;
 		long *prot_mem = prot->prot_mem(mem);
@@ -406,9 +393,12 @@ EXPORT_SYMBOL(memcg_sock_mem_alloc);
 
 void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt)
 {
-	mem = parent_mem_cgroup(mem);
-	for (; mem != NULL; mem = parent_mem_cgroup(mem))
-		atomic_long_sub(amt, prot->memory_allocated(mem));
+	struct mem_cgroup *parent;
+	parent = parent_mem_cgroup(mem);
+	for (; parent != NULL; parent = parent_mem_cgroup(parent))
+		atomic_long_sub(amt, prot->memory_allocated(parent));
+
+	atomic_dec(&mem->tcp.refcnt);
 }
 EXPORT_SYMBOL(memcg_sock_mem_free);
 
@@ -541,6 +531,7 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 
 	cg->tcp.tcp_memory_pressure = 0;
 	atomic_long_set(&cg->tcp.tcp_memory_allocated, 0);
+	atomic_set(&cg->tcp.refcnt, 0);
 	percpu_counter_init(&cg->tcp.tcp_sockets_allocated, 0);
 
 	limit = nr_free_buffer_pages() / 8;
@@ -5787,6 +5778,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
 
+	if (atomic_read(&mem->tcp.refcnt))
+		return 1;
+
 	if (mem->move_charge_at_immigrate) {
 		struct mm_struct *mm;
 		struct mem_cgroup *from = mem_cgroup_from_task(p);
@@ -5957,6 +5951,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
 				struct task_struct *p)
 {
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+	if (atomic_read(&mem->tcp.refcnt))
+		return 1;
+
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-27 20:43                 ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-27 20:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

[-- Attachment #1: Type: text/plain, Size: 2758 bytes --]

On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.
>
>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
>    - owner task of socket
>    - account moving callback
>
> Or disallow task moving once accounted.
>

So,

I tried to come up with proper task charge moving here, and the locking 
easily gets quite complicated. (But I have the feeling I am overlooking 
something...) So I think I'll really need more time for that.

What do you guys think of this following patch, + EXPERIMENTAL ?


[-- Attachment #2: foo.patch --]
[-- Type: text/plain, Size: 3232 bytes --]

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f784cb7..684c090 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -257,6 +257,7 @@ struct mem_cgroup;
 struct tcp_memcontrol {
 	/* per-cgroup tcp memory pressure knobs */
 	int tcp_max_memory;
+	atomic_t refcnt;
 	atomic_long_t tcp_memory_allocated;
 	struct percpu_counter tcp_sockets_allocated;
 	/* those two are read-mostly, leave them at the end */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6937f20..b594a9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -361,34 +361,21 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
 void sock_update_memcg(struct sock *sk)
 {
-	/* right now a socket spends its whole life in the same cgroup */
-	BUG_ON(sk->sk_cgrp);
-
 	rcu_read_lock();
 	sk->sk_cgrp = mem_cgroup_from_task(current);
-
-	/*
-	 * We don't need to protect against anything task-related, because
-	 * we are basically stuck with the sock pointer that won't change,
-	 * even if the task that originated the socket changes cgroups.
-	 *
-	 * What we do have to guarantee, is that the chain leading us to
-	 * the top level won't change under our noses. Incrementing the
-	 * reference count via cgroup_exclude_rmdir guarantees that.
-	 */
-	cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
 	rcu_read_unlock();
 }
 
 void sock_release_memcg(struct sock *sk)
 {
-	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
 }
 
 void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
 			  int amt, int *parent_failure)
 {
+	atomic_inc(&mem->tcp.refcnt);
 	mem = parent_mem_cgroup(mem);
+
 	for (; mem != NULL; mem = parent_mem_cgroup(mem)) {
 		long alloc;
 		long *prot_mem = prot->prot_mem(mem);
@@ -406,9 +393,12 @@ EXPORT_SYMBOL(memcg_sock_mem_alloc);
 
 void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt)
 {
-	mem = parent_mem_cgroup(mem);
-	for (; mem != NULL; mem = parent_mem_cgroup(mem))
-		atomic_long_sub(amt, prot->memory_allocated(mem));
+	struct mem_cgroup *parent;
+	parent = parent_mem_cgroup(mem);
+	for (; parent != NULL; parent = parent_mem_cgroup(parent))
+		atomic_long_sub(amt, prot->memory_allocated(parent));
+
+	atomic_dec(&mem->tcp.refcnt);
 }
 EXPORT_SYMBOL(memcg_sock_mem_free);
 
@@ -541,6 +531,7 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
 
 	cg->tcp.tcp_memory_pressure = 0;
 	atomic_long_set(&cg->tcp.tcp_memory_allocated, 0);
+	atomic_set(&cg->tcp.refcnt, 0);
 	percpu_counter_init(&cg->tcp.tcp_sockets_allocated, 0);
 
 	limit = nr_free_buffer_pages() / 8;
@@ -5787,6 +5778,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
 
+	if (atomic_read(&mem->tcp.refcnt))
+		return 1;
+
 	if (mem->move_charge_at_immigrate) {
 		struct mm_struct *mm;
 		struct mem_cgroup *from = mem_cgroup_from_task(p);
@@ -5957,6 +5951,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
 				struct task_struct *p)
 {
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+	if (atomic_read(&mem->tcp.refcnt))
+		return 1;
+
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
  2011-09-26 22:47                 ` Glauber Costa
@ 2011-09-28  0:56                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  0:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On Mon, 26 Sep 2011 19:47:24 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> > On Sat, 24 Sep 2011 11:45:04 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> On 09/22/2011 12:09 PM, Balbir Singh wrote:
> >>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
> >>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
> >>>>> Right now I am working under the assumption that tasks are long lived inside
> >>>>> the cgroup. Migration potentially introduces some nasty locking problems in
> >>>>> the mem_schedule path.
> >>>>>
> >>>>> Also, unless I am missing something, the memcg already has the policy of
> >>>>> not carrying charges around, probably because of this very same complexity.
> >>>>>
> >>>>> True that at least it won't EBUSY you... But I think this is at least a way
> >>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
> >>>>> our allocations.
> >>>>
> >>>> Here's the memcg user page behavior using the same pattern:
> >>>>
> >>>> 1. user page P is allocate by task T in memcg M1
> >>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
> >>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> >>>> M2 if memory.move_charge_at_immigrate=1.
> >>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> >>>> reclaim, then P is recharged to parent(M1).
> >>>>
> >>>
> >>> We also have some magic in page_referenced() to remove pages
> >>> referenced from different containers. What we do is try not to
> >>> penalize a cgroup if another cgroup is referencing this page and the
> >>> page under consideration is being reclaimed from the cgroup that
> >>> touched it.
> >>>
> >>> Balbir Singh
> >> Do you guys see it as a showstopper for this series to be merged, or can
> >> we just TODO it ?
> >>
> >
> > In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> > problem. The users cannot know where the accouting is leaking other than
> > kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
> >
> > please add EXPERIMENTAL to Kconfig until this is fixed.
> 
> I am working on something here that may allow it.
> But I think it is independent of the rest, and I can repost the series 
> fixing the problems raised here without it, + EXPERIMENTAL.
> 
> Btw, using EXPERIMENTAL here is a very good idea. I think that we should
> turn EXPERIMENTAL on even if I fix for that exists, for a least a couple
> of months until we see how this thing really evolves.
> 
> What do you think?
> 

Yes, I think so. IIRC, SWAP accounting was EXPERIMENTAL for a year.

> >> I can push a proposal for it, but it would be done in a separate patch
> >> anyway. Also, we may be in better conditions to fix this when the slab
> >> part is merged - since it will likely have the same problems...
> >>
> >
> > Yes. considering sockets which can be shared between tasks(cgroups)
> > you'll finally need
> >    - owner task of socket
> >    - account moving callback
> >
> > Or disallow task moving once accounted.
> 
> I personally think disallowing task movement once accounted is 
> reasonable. At least for starters.
> 

Hmm. I'm ok with that...but I'm not very sure how that will be trouble.
So, please make it debuggable why task cannot be moved.

Thanks,
-Kame








^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 2/7] socket: initial cgroup code.
@ 2011-09-28  0:56                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  0:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, Greg Thelen, linux-kernel, paul, lizf, ebiederm,
	davem, netdev, linux-mm, kirill

On Mon, 26 Sep 2011 19:47:24 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> > On Sat, 24 Sep 2011 11:45:04 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> On 09/22/2011 12:09 PM, Balbir Singh wrote:
> >>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com>   wrote:
> >>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com>   wrote:
> >>>>> Right now I am working under the assumption that tasks are long lived inside
> >>>>> the cgroup. Migration potentially introduces some nasty locking problems in
> >>>>> the mem_schedule path.
> >>>>>
> >>>>> Also, unless I am missing something, the memcg already has the policy of
> >>>>> not carrying charges around, probably because of this very same complexity.
> >>>>>
> >>>>> True that at least it won't EBUSY you... But I think this is at least a way
> >>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
> >>>>> our allocations.
> >>>>
> >>>> Here's the memcg user page behavior using the same pattern:
> >>>>
> >>>> 1. user page P is allocate by task T in memcg M1
> >>>> 2. T is moved to memcg M2.  The P charge is left behind still charged
> >>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
> >>>> M2 if memory.move_charge_at_immigrate=1.
> >>>> 3. rmdir M1 will try to reclaim P (if P was left in M1).  If unable to
> >>>> reclaim, then P is recharged to parent(M1).
> >>>>
> >>>
> >>> We also have some magic in page_referenced() to remove pages
> >>> referenced from different containers. What we do is try not to
> >>> penalize a cgroup if another cgroup is referencing this page and the
> >>> page under consideration is being reclaimed from the cgroup that
> >>> touched it.
> >>>
> >>> Balbir Singh
> >> Do you guys see it as a showstopper for this series to be merged, or can
> >> we just TODO it ?
> >>
> >
> > In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> > problem. The users cannot know where the accouting is leaking other than
> > kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
> >
> > please add EXPERIMENTAL to Kconfig until this is fixed.
> 
> I am working on something here that may allow it.
> But I think it is independent of the rest, and I can repost the series 
> fixing the problems raised here without it, + EXPERIMENTAL.
> 
> Btw, using EXPERIMENTAL here is a very good idea. I think that we should
> turn EXPERIMENTAL on even if I fix for that exists, for a least a couple
> of months until we see how this thing really evolves.
> 
> What do you think?
> 

Yes, I think so. IIRC, SWAP accounting was EXPERIMENTAL for a year.

> >> I can push a proposal for it, but it would be done in a separate patch
> >> anyway. Also, we may be in better conditions to fix this when the slab
> >> part is merged - since it will likely have the same problems...
> >>
> >
> > Yes. considering sockets which can be shared between tasks(cgroups)
> > you'll finally need
> >    - owner task of socket
> >    - account moving callback
> >
> > Or disallow task moving once accounted.
> 
> I personally think disallowing task movement once accounted is 
> reasonable. At least for starters.
> 

Hmm. I'm ok with that...but I'm not very sure how that will be trouble.
So, please make it debuggable why task cannot be moved.

Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-26 23:18       ` Glauber Costa
  (?)
@ 2011-09-28  0:58         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  0:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Mon, 26 Sep 2011 20:18:39 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
> > On Sun, 18 Sep 2011 21:56:39 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
"If parent sets use_hierarchy==1, children must have the same kmem_independent value
> > with parant's one."
> >
> > How do you think ? I think a hierarchy must have the same config.
> BTW, Kame:
> 
> Look again (I forgot myself when I first replied to you)
> Only in the root cgroup those files get registered.
> So shouldn't be a problem, because children won't even
> be able to see them.
> 
> Do you agree with this ?
> 

agreed.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-28  0:58         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  0:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Mon, 26 Sep 2011 20:18:39 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
> > On Sun, 18 Sep 2011 21:56:39 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
"If parent sets use_hierarchy==1, children must have the same kmem_independent value
> > with parant's one."
> >
> > How do you think ? I think a hierarchy must have the same config.
> BTW, Kame:
> 
> Look again (I forgot myself when I first replied to you)
> Only in the root cgroup those files get registered.
> So shouldn't be a problem, because children won't even
> be able to see them.
> 
> Do you agree with this ?
> 

agreed.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-28  0:58         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  0:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Mon, 26 Sep 2011 20:18:39 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
> > On Sun, 18 Sep 2011 21:56:39 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
"If parent sets use_hierarchy==1, children must have the same kmem_independent value
> > with parant's one."
> >
> > How do you think ? I think a hierarchy must have the same config.
> BTW, Kame:
> 
> Look again (I forgot myself when I first replied to you)
> Only in the root cgroup those files get registered.
> So shouldn't be a problem, because children won't even
> be able to see them.
> 
> Do you agree with this ?
> 

agreed.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
  2011-09-27  1:53       ` Glauber Costa
@ 2011-09-28  1:09         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  1:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Mon, 26 Sep 2011 22:53:05 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> > On Sun, 18 Sep 2011 21:56:42 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> With all the infrastructure in place, this patch implements
> >> per-cgroup control for tcp memory pressure handling.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >
> > a comment below.
> >
> >> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
> >> +		    struct cgroup_subsys *ss)
> >> +{
> >> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
> >> +	unsigned long limit;
> >> +
> >> +	cg->tcp_memory_pressure = 0;
> >> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
> >> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
> >> +
> >> +	limit = nr_free_buffer_pages() / 8;
> >> +	limit = max(limit, 128UL);
> >> +
> >> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
> >> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
> >> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
> >> +
> >
> > Then, the parameter doesn't inherit parent's one ?
> >
> > I think sockets_populate should pass 'parent' and
> >
> >
> > I think you should have a function
> >
> >      mem_cgroup_should_inherit_parent_settings(parent)
> >
> > (This is because you made this feature as a part of memcg.
> >   please provide expected behavior.)
> >
> > Thanks,
> > -Kame
> 
> Kame: Another look into this:
> 
> sysctl_tcp_mem is a global value, unless you have different namespaces.
> So it is either global anyway, or should come from the namespace, not 
> the parent.
> 
> Now, the goal here is to set the maximum possible value for those 
> fields. That, indeed, should come from the parent.
> 
> That's my understanding...
> 
Hmm, I may misunderstand something. If this isn't a value you don't want to limit
by memcg's kmem_limit, it's ok.
Maybe memcg should just take care of kmem_limit.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 4/7] per-cgroup tcp buffers control
@ 2011-09-28  1:09         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 119+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  1:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On Mon, 26 Sep 2011 22:53:05 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/26/2011 07:59 AM, KAMEZAWA Hiroyuki wrote:
> > On Sun, 18 Sep 2011 21:56:42 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> With all the infrastructure in place, this patch implements
> >> per-cgroup control for tcp memory pressure handling.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >
> > a comment below.
> >
> >> +int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
> >> +		    struct cgroup_subsys *ss)
> >> +{
> >> +	struct mem_cgroup *cg = mem_cgroup_from_cont(cgrp);
> >> +	unsigned long limit;
> >> +
> >> +	cg->tcp_memory_pressure = 0;
> >> +	atomic_long_set(&cg->tcp_memory_allocated, 0);
> >> +	percpu_counter_init(&cg->tcp_sockets_allocated, 0);
> >> +
> >> +	limit = nr_free_buffer_pages() / 8;
> >> +	limit = max(limit, 128UL);
> >> +
> >> +	cg->tcp_prot_mem[0] = sysctl_tcp_mem[0];
> >> +	cg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
> >> +	cg->tcp_prot_mem[2] = sysctl_tcp_mem[2];
> >> +
> >
> > Then, the parameter doesn't inherit parent's one ?
> >
> > I think sockets_populate should pass 'parent' and
> >
> >
> > I think you should have a function
> >
> >      mem_cgroup_should_inherit_parent_settings(parent)
> >
> > (This is because you made this feature as a part of memcg.
> >   please provide expected behavior.)
> >
> > Thanks,
> > -Kame
> 
> Kame: Another look into this:
> 
> sysctl_tcp_mem is a global value, unless you have different namespaces.
> So it is either global anyway, or should come from the namespace, not 
> the parent.
> 
> Now, the goal here is to set the maximum possible value for those 
> fields. That, indeed, should come from the parent.
> 
> That's my understanding...
> 
Hmm, I may misunderstand something. If this isn't a value you don't want to limit
by memcg's kmem_limit, it's ok.
Maybe memcg should just take care of kmem_limit.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-24 16:58     ` Andi Kleen
@ 2011-09-28  2:29       ` Balbir Singh
  -1 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-28  2:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, gthelen, netdev, linux-mm, kirill

On Sat, Sep 24, 2011 at 10:28 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Glauber Costa <glommer@parallels.com> writes:
>
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>
> I noticed that some other OS known by bash seem to have a rlimit per
> process for this. Would that make sense too? Not sure how difficult
> your infrastructure would be to extend to that.

rlimit per process for tcp usage? Interesting, that reminds me, we
need to revisit rlimit (RSS) at some point

Balbir Singh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-28  2:29       ` Balbir Singh
  0 siblings, 0 replies; 119+ messages in thread
From: Balbir Singh @ 2011-09-28  2:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Glauber Costa, linux-kernel, paul, lizf, kamezawa.hiroyu,
	ebiederm, davem, gthelen, netdev, linux-mm, kirill

On Sat, Sep 24, 2011 at 10:28 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Glauber Costa <glommer@parallels.com> writes:
>
>> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
>> effectively control the amount of kernel memory pinned by a cgroup.
>>
>> We have to make sure that none of the memory pressure thresholds
>> specified in the namespace are bigger than the current cgroup.
>
> I noticed that some other OS known by bash seem to have a rlimit per
> process for this. Would that make sense too? Not sure how difficult
> your infrastructure would be to extend to that.

rlimit per process for tcp usage? Interesting, that reminds me, we
need to revisit rlimit (RSS) at some point

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
  2011-09-28  2:29       ` Balbir Singh
@ 2011-09-28  3:06         ` Andi Kleen
  -1 siblings, 0 replies; 119+ messages in thread
From: Andi Kleen @ 2011-09-28  3:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andi Kleen, Glauber Costa, linux-kernel, paul, lizf,
	kamezawa.hiroyu, ebiederm, davem, gthelen, netdev, linux-mm,
	kirill

On Wed, Sep 28, 2011 at 07:59:31AM +0530, Balbir Singh wrote:
> On Sat, Sep 24, 2011 at 10:28 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Glauber Costa <glommer@parallels.com> writes:
> >
> >> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> >> effectively control the amount of kernel memory pinned by a cgroup.
> >>
> >> We have to make sure that none of the memory pressure thresholds
> >> specified in the namespace are bigger than the current cgroup.
> >
> > I noticed that some other OS known by bash seem to have a rlimit per
> > process for this. Would that make sense too? Not sure how difficult
> > your infrastructure would be to extend to that.
> 
> rlimit per process for tcp usage? Interesting, that reminds me, we
> need to revisit rlimit (RSS) at some point

I would love to have that for some situations!
-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit
@ 2011-09-28  3:06         ` Andi Kleen
  0 siblings, 0 replies; 119+ messages in thread
From: Andi Kleen @ 2011-09-28  3:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andi Kleen, Glauber Costa, linux-kernel, paul, lizf,
	kamezawa.hiroyu, ebiederm, davem, gthelen, netdev, linux-mm,
	kirill

On Wed, Sep 28, 2011 at 07:59:31AM +0530, Balbir Singh wrote:
> On Sat, Sep 24, 2011 at 10:28 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Glauber Costa <glommer@parallels.com> writes:
> >
> >> This patch uses the "tcp_max_mem" field of the kmem_cgroup to
> >> effectively control the amount of kernel memory pinned by a cgroup.
> >>
> >> We have to make sure that none of the memory pressure thresholds
> >> specified in the namespace are bigger than the current cgroup.
> >
> > I noticed that some other OS known by bash seem to have a rlimit per
> > process for this. Would that make sense too? Not sure how difficult
> > your infrastructure would be to extend to that.
> 
> rlimit per process for tcp usage? Interesting, that reminds me, we
> need to revisit rlimit (RSS) at some point

I would love to have that for some situations!
-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
  2011-09-28  0:58         ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-28 12:03           ` Glauber Costa
  -1 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-28 12:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/27/2011 09:58 PM, KAMEZAWA Hiroyuki wrote:
> On Mon, 26 Sep 2011 20:18:39 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
>>> On Sun, 18 Sep 2011 21:56:39 -0300
>>> Glauber Costa<glommer@parallels.com>   wrote:
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
>>> with parant's one."
>>>
>>> How do you think ? I think a hierarchy must have the same config.
>> BTW, Kame:
>>
>> Look again (I forgot myself when I first replied to you)
>> Only in the root cgroup those files get registered.
>> So shouldn't be a problem, because children won't even
>> be able to see them.
>>
>> Do you agree with this ?
>>
>
> agreed.
>

Actually it is the other way around, following previous suggestions...

The root cgroup does *not* get those files registered, since we don't 
intend to do any kernel memory limitation for it. The others get it.
Given that, I will proceed writing some code to respect parent cgroup's
hierarchy.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-28 12:03           ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-28 12:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/27/2011 09:58 PM, KAMEZAWA Hiroyuki wrote:
> On Mon, 26 Sep 2011 20:18:39 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
>>> On Sun, 18 Sep 2011 21:56:39 -0300
>>> Glauber Costa<glommer@parallels.com>   wrote:
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
>>> with parant's one."
>>>
>>> How do you think ? I think a hierarchy must have the same config.
>> BTW, Kame:
>>
>> Look again (I forgot myself when I first replied to you)
>> Only in the root cgroup those files get registered.
>> So shouldn't be a problem, because children won't even
>> be able to see them.
>>
>> Do you agree with this ?
>>
>
> agreed.
>

Actually it is the other way around, following previous suggestions...

The root cgroup does *not* get those files registered, since we don't 
intend to do any kernel memory limitation for it. The others get it.
Given that, I will proceed writing some code to respect parent cgroup's
hierarchy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller
@ 2011-09-28 12:03           ` Glauber Costa
  0 siblings, 0 replies; 119+ messages in thread
From: Glauber Costa @ 2011-09-28 12:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill

On 09/27/2011 09:58 PM, KAMEZAWA Hiroyuki wrote:
> On Mon, 26 Sep 2011 20:18:39 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 09/26/2011 07:34 AM, KAMEZAWA Hiroyuki wrote:
>>> On Sun, 18 Sep 2011 21:56:39 -0300
>>> Glauber Costa<glommer@parallels.com>   wrote:
> "If parent sets use_hierarchy==1, children must have the same kmem_independent value
>>> with parant's one."
>>>
>>> How do you think ? I think a hierarchy must have the same config.
>> BTW, Kame:
>>
>> Look again (I forgot myself when I first replied to you)
>> Only in the root cgroup those files get registered.
>> So shouldn't be a problem, because children won't even
>> be able to see them.
>>
>> Do you agree with this ?
>>
>
> agreed.
>

Actually it is the other way around, following previous suggestions...

The root cgroup does *not* get those files registered, since we don't 
intend to do any kernel memory limitation for it. The others get it.
Given that, I will proceed writing some code to respect parent cgroup's
hierarchy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2011-09-28 12:04 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-19  0:56 [PATCH v3 0/7] per-cgroup tcp buffer pressure settings Glauber Costa
2011-09-19  0:56 ` Glauber Costa
2011-09-19  0:56 ` [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller Glauber Costa
2011-09-19  0:56   ` Glauber Costa
2011-09-21  2:23   ` Glauber Costa
2011-09-21  2:23     ` Glauber Costa
2011-09-21  2:23     ` Glauber Costa
2011-09-22  3:17     ` Balbir Singh
2011-09-22  3:17       ` Balbir Singh
2011-09-22  3:19       ` Glauber Costa
2011-09-22  3:19         ` Glauber Costa
2011-09-22  3:19         ` Glauber Costa
2011-09-24 14:43       ` Glauber Costa
2011-09-24 14:43         ` Glauber Costa
2011-09-24 14:43         ` Glauber Costa
2011-09-27 10:06         ` Balbir Singh
2011-09-27 10:06           ` Balbir Singh
2011-09-22  5:58   ` Greg Thelen
2011-09-22  5:58     ` Greg Thelen
2011-09-26 10:34   ` KAMEZAWA Hiroyuki
2011-09-26 10:34     ` KAMEZAWA Hiroyuki
2011-09-26 22:44     ` Glauber Costa
2011-09-26 22:44       ` Glauber Costa
2011-09-26 22:44       ` Glauber Costa
2011-09-26 23:18     ` Glauber Costa
2011-09-26 23:18       ` Glauber Costa
2011-09-26 23:18       ` Glauber Costa
2011-09-28  0:58       ` KAMEZAWA Hiroyuki
2011-09-28  0:58         ` KAMEZAWA Hiroyuki
2011-09-28  0:58         ` KAMEZAWA Hiroyuki
2011-09-28 12:03         ` Glauber Costa
2011-09-28 12:03           ` Glauber Costa
2011-09-28 12:03           ` Glauber Costa
2011-09-19  0:56 ` [PATCH v3 2/7] socket: initial cgroup code Glauber Costa
2011-09-19  0:56   ` Glauber Costa
2011-09-21 18:47   ` Greg Thelen
2011-09-21 18:47     ` Greg Thelen
2011-09-21 18:59     ` Glauber Costa
2011-09-21 18:59       ` Glauber Costa
2011-09-21 18:59       ` Glauber Costa
2011-09-22  6:00       ` Greg Thelen
2011-09-22  6:00         ` Greg Thelen
2011-09-22 15:09         ` Balbir Singh
2011-09-22 15:09           ` Balbir Singh
2011-09-24 13:33           ` Glauber Costa
2011-09-24 13:33             ` Glauber Costa
2011-09-24 13:33             ` Glauber Costa
2011-09-24 13:40           ` Glauber Costa
2011-09-24 13:40             ` Glauber Costa
2011-09-24 13:40             ` Glauber Costa
2011-09-24 14:45           ` Glauber Costa
2011-09-24 14:45             ` Glauber Costa
2011-09-24 14:45             ` Glauber Costa
2011-09-26 10:52             ` KAMEZAWA Hiroyuki
2011-09-26 10:52               ` KAMEZAWA Hiroyuki
2011-09-26 10:52               ` KAMEZAWA Hiroyuki
2011-09-26 22:47               ` Glauber Costa
2011-09-26 22:47                 ` Glauber Costa
2011-09-26 22:47                 ` Glauber Costa
2011-09-28  0:56                 ` KAMEZAWA Hiroyuki
2011-09-28  0:56                   ` KAMEZAWA Hiroyuki
2011-09-27 20:43               ` Glauber Costa
2011-09-27 20:43                 ` Glauber Costa
2011-09-19  0:56 ` [PATCH v3 3/7] foundations of per-cgroup memory pressure controlling Glauber Costa
2011-09-19  0:56   ` Glauber Costa
2011-09-19  0:56 ` [PATCH v3 4/7] per-cgroup tcp buffers control Glauber Costa
2011-09-19  0:56   ` Glauber Costa
2011-09-26 10:59   ` KAMEZAWA Hiroyuki
2011-09-26 10:59     ` KAMEZAWA Hiroyuki
2011-09-26 22:48     ` Glauber Costa
2011-09-26 22:48       ` Glauber Costa
2011-09-26 22:48       ` Glauber Costa
2011-09-27  1:53     ` Glauber Costa
2011-09-27  1:53       ` Glauber Costa
2011-09-27  1:53       ` Glauber Costa
2011-09-28  1:09       ` KAMEZAWA Hiroyuki
2011-09-28  1:09         ` KAMEZAWA Hiroyuki
2011-09-26 14:39   ` Andrew Vagin
2011-09-26 14:39     ` Andrew Vagin
2011-09-26 22:52     ` Glauber Costa
2011-09-26 22:52       ` Glauber Costa
2011-09-26 22:52       ` Glauber Costa
2011-09-19  0:56 ` [PATCH v3 5/7] per-netns ipv4 sysctl_tcp_mem Glauber Costa
2011-09-19  0:56   ` Glauber Costa
2011-09-19  0:56 ` [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit Glauber Costa
2011-09-19  0:56   ` Glauber Costa
2011-09-22  6:01   ` Greg Thelen
2011-09-22  6:01     ` Greg Thelen
2011-09-22  9:58     ` Kirill A. Shutemov
2011-09-22  9:58       ` Kirill A. Shutemov
2011-09-22  9:58       ` Kirill A. Shutemov
2011-09-22 15:44       ` Greg Thelen
2011-09-22 15:44         ` Greg Thelen
2011-09-24 13:30     ` Glauber Costa
2011-09-24 13:30       ` Glauber Costa
2011-09-24 13:30       ` Glauber Costa
2011-09-26 11:02       ` KAMEZAWA Hiroyuki
2011-09-26 11:02         ` KAMEZAWA Hiroyuki
2011-09-26 11:02         ` KAMEZAWA Hiroyuki
2011-09-26 22:49         ` Glauber Costa
2011-09-26 22:49           ` Glauber Costa
2011-09-26 22:49           ` Glauber Costa
2011-09-22 23:08   ` Balbir Singh
2011-09-22 23:08     ` Balbir Singh
2011-09-24 13:35     ` Glauber Costa
2011-09-24 13:35       ` Glauber Costa
2011-09-24 13:35       ` Glauber Costa
2011-09-24 16:58   ` Andi Kleen
2011-09-24 16:58     ` Andi Kleen
2011-09-24 16:58     ` Andi Kleen
2011-09-24 17:27     ` Glauber Costa
2011-09-24 17:27       ` Glauber Costa
2011-09-24 17:27       ` Glauber Costa
2011-09-28  2:29     ` Balbir Singh
2011-09-28  2:29       ` Balbir Singh
2011-09-28  3:06       ` Andi Kleen
2011-09-28  3:06         ` Andi Kleen
2011-09-19  0:56 ` [PATCH v3 7/7] Display current tcp memory allocation in kmem cgroup Glauber Costa
2011-09-19  0:56   ` Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.