All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] per-cgroup tcp buffers limitation
@ 2011-09-07  4:23 ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, containers, netdev, xemul, Glauber Costa

This patch introduces per-cgroup tcp buffers limitation. This allows
sysadmins to specify a maximum amount of kernel memory that
tcp connections can use at any point in time. TCP is the main interest
in this work, but extending it to other protocols would be easy.

For this to work, I am introducing kmem_cgroup, a cgroup targetted
at tracking and controlling objects in kernel memory. Since they
are usually not found in page granularity, and are fundamentally
different from userspace memory (not swappable, can't overcommit),
I am proposing those objects live in its own cgroup rather than
in the memory controller.

It piggybacks in the memory control mechanism already present in
/proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
that will suppress allocation when reached. For each cgroup, however,
the file kmem.tcp_maxmem will be used to cap those values.

The usage I have in mind here is containers. Each container will
define its own values for soft and hard limits, but none of them will
be possibly bigger than the value the box' sysadmin specified from
the outside.

To test for any performance impacts of this patch, I used netperf's
TCP_RR benchmark on localhost, so we can have both recv and snd in action.

Command line used was ./src/netperf -t TCP_RR -H localhost, and the
results:

Without the patch
=================

Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    26996.35
16384  87380

With the patch
===============

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    27291.86
16384  87380

The difference is within a one-percent range.

Nesting cgroups doesn't seem to be the dominating factor as well,
with nestings up to 10 levels not showing a significant performance
difference.

Glauber Costa (9):
  per-netns ipv4 sysctl_tcp_mem
  Kernel Memory cgroup
  socket: initial cgroup code.
  function wrappers for upcoming socket
  foundations of per-cgroup memory pressure controlling.
  per-cgroup tcp buffers control
  tcp buffer limitation: per-cgroup limit
  Display current tcp memory allocation in kmem cgroup
  Add documentation about kmem_cgroup

 Documentation/cgroups/kmem_cgroups.txt |   27 +++++
 crypto/af_alg.c                        |    7 +-
 include/linux/cgroup_subsys.h          |    4 +
 include/linux/kmem_cgroup.h            |  194 ++++++++++++++++++++++++++++++++
 include/net/netns/ipv4.h               |    1 +
 include/net/sock.h                     |   37 +++++-
 include/net/tcp.h                      |   13 ++-
 include/net/udp.h                      |    3 +-
 include/trace/events/sock.h            |   10 +-
 init/Kconfig                           |   11 ++
 mm/Makefile                            |    1 +
 mm/kmem_cgroup.c                       |   61 ++++++++++
 net/core/sock.c                        |   88 ++++++++++-----
 net/decnet/af_decnet.c                 |   21 +++-
 net/ipv4/proc.c                        |    7 +-
 net/ipv4/sysctl_net_ipv4.c             |   59 +++++++++-
 net/ipv4/tcp.c                         |  181 ++++++++++++++++++++++++++----
 net/ipv4/tcp_input.c                   |   12 +-
 net/ipv4/tcp_ipv4.c                    |   15 ++-
 net/ipv4/tcp_output.c                  |    2 +-
 net/ipv4/tcp_timer.c                   |    2 +-
 net/ipv4/udp.c                         |   20 +++-
 net/ipv6/tcp_ipv6.c                    |   10 +-
 net/ipv6/udp.c                         |    4 +-
 net/sctp/socket.c                      |   35 +++++--
 25 files changed, 710 insertions(+), 115 deletions(-)
 create mode 100644 Documentation/cgroups/kmem_cgroups.txt
 create mode 100644 include/linux/kmem_cgroup.h
 create mode 100644 mm/kmem_cgroup.c

-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 0/9] per-cgroup tcp buffers limitation
@ 2011-09-07  4:23 ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, containers, netdev, xemul, Glauber Costa

This patch introduces per-cgroup tcp buffers limitation. This allows
sysadmins to specify a maximum amount of kernel memory that
tcp connections can use at any point in time. TCP is the main interest
in this work, but extending it to other protocols would be easy.

For this to work, I am introducing kmem_cgroup, a cgroup targetted
at tracking and controlling objects in kernel memory. Since they
are usually not found in page granularity, and are fundamentally
different from userspace memory (not swappable, can't overcommit),
I am proposing those objects live in its own cgroup rather than
in the memory controller.

It piggybacks in the memory control mechanism already present in
/proc/sys/net/ipv4/tcp_mem. There is a soft limit, and a hard limit,
that will suppress allocation when reached. For each cgroup, however,
the file kmem.tcp_maxmem will be used to cap those values.

The usage I have in mind here is containers. Each container will
define its own values for soft and hard limits, but none of them will
be possibly bigger than the value the box' sysadmin specified from
the outside.

To test for any performance impacts of this patch, I used netperf's
TCP_RR benchmark on localhost, so we can have both recv and snd in action.

Command line used was ./src/netperf -t TCP_RR -H localhost, and the
results:

Without the patch
=================

Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    26996.35
16384  87380

With the patch
===============

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    27291.86
16384  87380

The difference is within a one-percent range.

Nesting cgroups doesn't seem to be the dominating factor as well,
with nestings up to 10 levels not showing a significant performance
difference.

Glauber Costa (9):
  per-netns ipv4 sysctl_tcp_mem
  Kernel Memory cgroup
  socket: initial cgroup code.
  function wrappers for upcoming socket
  foundations of per-cgroup memory pressure controlling.
  per-cgroup tcp buffers control
  tcp buffer limitation: per-cgroup limit
  Display current tcp memory allocation in kmem cgroup
  Add documentation about kmem_cgroup

 Documentation/cgroups/kmem_cgroups.txt |   27 +++++
 crypto/af_alg.c                        |    7 +-
 include/linux/cgroup_subsys.h          |    4 +
 include/linux/kmem_cgroup.h            |  194 ++++++++++++++++++++++++++++++++
 include/net/netns/ipv4.h               |    1 +
 include/net/sock.h                     |   37 +++++-
 include/net/tcp.h                      |   13 ++-
 include/net/udp.h                      |    3 +-
 include/trace/events/sock.h            |   10 +-
 init/Kconfig                           |   11 ++
 mm/Makefile                            |    1 +
 mm/kmem_cgroup.c                       |   61 ++++++++++
 net/core/sock.c                        |   88 ++++++++++-----
 net/decnet/af_decnet.c                 |   21 +++-
 net/ipv4/proc.c                        |    7 +-
 net/ipv4/sysctl_net_ipv4.c             |   59 +++++++++-
 net/ipv4/tcp.c                         |  181 ++++++++++++++++++++++++++----
 net/ipv4/tcp_input.c                   |   12 +-
 net/ipv4/tcp_ipv4.c                    |   15 ++-
 net/ipv4/tcp_output.c                  |    2 +-
 net/ipv4/tcp_timer.c                   |    2 +-
 net/ipv4/udp.c                         |   20 +++-
 net/ipv6/tcp_ipv6.c                    |   10 +-
 net/ipv6/udp.c                         |    4 +-
 net/sctp/socket.c                      |   35 +++++--
 25 files changed, 710 insertions(+), 115 deletions(-)
 create mode 100644 Documentation/cgroups/kmem_cgroups.txt
 create mode 100644 include/linux/kmem_cgroup.h
 create mode 100644 mm/kmem_cgroup.c

-- 
1.7.6


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    1 -
 net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c             |   13 +++-------
 4 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..6bfdd9b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..0d74b9d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < 3; i++)
+		net->ipv4.sysctl_tcp_mem[i] = vec[i];
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..f06df24 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -266,6 +266,7 @@
 #include <linux/crypto.h>
 #include <linux/time.h>
 #include <linux/slab.h>
+#include <linux/nsproxy.h>
 
 #include <net/icmp.h>
 #include <net/tcp.h>
@@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
@@ -3277,14 +3276,10 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
+	limit <<= (PAGE_SHIFT - 7);
+
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    1 -
 net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c             |   13 +++-------
 4 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..6bfdd9b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..0d74b9d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < 3; i++)
+		net->ipv4.sysctl_tcp_mem[i] = vec[i];
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..f06df24 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -266,6 +266,7 @@
 #include <linux/crypto.h>
 #include <linux/time.h>
 #include <linux/slab.h>
+#include <linux/nsproxy.h>
 
 #include <net/icmp.h>
 #include <net/tcp.h>
@@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
@@ -3277,14 +3276,10 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
+	limit <<= (PAGE_SHIFT - 7);
+
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 2/9] Kernel Memory cgroup
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch introduces the kernel memory cgroup. Its purpose
is to track and control/limit allocation of kernel objects.
Kernel objects are very different in nature than user memory,
because they can't be swapped out, so can't be overcommited.

The first incarnation is very simple. The current patch doesn't
add any objects to be tracked, but rather, just the cgroup
structure.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/cgroup_subsys.h |    4 +++
 include/linux/kmem_cgroup.h   |   53 +++++++++++++++++++++++++++++++++++++++++
 init/Kconfig                  |   11 ++++++++
 mm/Makefile                   |    1 +
 mm/kmem_cgroup.c              |   53 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 122 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/kmem_cgroup.h
 create mode 100644 mm/kmem_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..363b8e8 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -35,6 +35,10 @@ SUBSYS(cpuacct)
 SUBSYS(mem_cgroup)
 #endif
 
+#ifdef CONFIG_CGROUP_KMEM
+SUBSYS(kmem)
+#endif
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEVICE
diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
new file mode 100644
index 0000000..0e4a74b
--- /dev/null
+++ b/include/linux/kmem_cgroup.h
@@ -0,0 +1,53 @@
+/* kmem_cgroup.h - Kernel Memory Controller
+ *
+ * Copyright Parallels Inc., 2011
+ * Author: Glauber Costa <glommer@parallels.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _LINUX_KMEM_CGROUP_H
+#define _LINUX_KMEM_CGROUP_H
+#include <linux/cgroup.h>
+#include <linux/atomic.h>
+#include <linux/percpu_counter.h>
+
+struct kmem_cgroup {
+	struct cgroup_subsys_state css;
+	struct kmem_cgroup *parent;
+};
+
+
+#ifdef CONFIG_CGROUP_KMEM
+static inline struct kmem_cgroup *kcg_from_cgroup(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, kmem_subsys_id),
+		struct kmem_cgroup, css);
+}
+
+static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, kmem_subsys_id),
+		struct kmem_cgroup, css);
+}
+#else
+static inline struct kmem_cgroup *kcg_from_cgroup(struct cgroup *cgrp)
+{
+	return NULL;
+}
+
+static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
+{
+	return NULL;
+}
+#endif /* CONFIG_CGROUP_KMEM */
+#endif /* _LINUX_KMEM_CGROUP_H */
+
diff --git a/init/Kconfig b/init/Kconfig
index d627783..5955ac2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_KMEM
+	bool "Kernel Memory Resource Controller for Control Groups"
+	depends on CGROUPS
+	help
+	  The Kernel Memory cgroup can limit the amount of memory used by
+	  certain kernel objects in the system. Those are fundamentally
+	  different from the entities handled by the Memory Controller,
+	  which are page-based, and can be swapped. Users of the kmem
+	  cgroup can use it to guarantee that no group of processes will
+	  ever exhaust kernel resources alone.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 836e416..1b1aa24 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/kmem_cgroup.c b/mm/kmem_cgroup.c
new file mode 100644
index 0000000..7950e69
--- /dev/null
+++ b/mm/kmem_cgroup.c
@@ -0,0 +1,53 @@
+/* kmem_cgroup.c - Kernel Memory Controller
+ *
+ * Copyright Parallels Inc, 2011
+ * Author: Glauber Costa <glommer@parallels.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/kmem_cgroup.h>
+
+static int kmem_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return 0;
+}
+
+static void
+kmem_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct kmem_cgroup *cg = kcg_from_cgroup(cgrp);
+	kfree(cg);
+}
+
+static struct cgroup_subsys_state *kmem_create(
+	struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);
+
+	if (!sk)
+		return ERR_PTR(-ENOMEM);
+
+	if (cgrp->parent)
+		sk->parent = kcg_from_cgroup(cgrp->parent);
+
+	return &sk->css;
+}
+
+struct cgroup_subsys kmem_subsys = {
+	.name = "kmem",
+	.create = kmem_create,
+	.destroy = kmem_destroy,
+	.populate = kmem_populate,
+	.subsys_id = kmem_subsys_id,
+};
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 2/9] Kernel Memory cgroup
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch introduces the kernel memory cgroup. Its purpose
is to track and control/limit allocation of kernel objects.
Kernel objects are very different in nature than user memory,
because they can't be swapped out, so can't be overcommited.

The first incarnation is very simple. The current patch doesn't
add any objects to be tracked, but rather, just the cgroup
structure.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/cgroup_subsys.h |    4 +++
 include/linux/kmem_cgroup.h   |   53 +++++++++++++++++++++++++++++++++++++++++
 init/Kconfig                  |   11 ++++++++
 mm/Makefile                   |    1 +
 mm/kmem_cgroup.c              |   53 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 122 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/kmem_cgroup.h
 create mode 100644 mm/kmem_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..363b8e8 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -35,6 +35,10 @@ SUBSYS(cpuacct)
 SUBSYS(mem_cgroup)
 #endif
 
+#ifdef CONFIG_CGROUP_KMEM
+SUBSYS(kmem)
+#endif
+
 /* */
 
 #ifdef CONFIG_CGROUP_DEVICE
diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
new file mode 100644
index 0000000..0e4a74b
--- /dev/null
+++ b/include/linux/kmem_cgroup.h
@@ -0,0 +1,53 @@
+/* kmem_cgroup.h - Kernel Memory Controller
+ *
+ * Copyright Parallels Inc., 2011
+ * Author: Glauber Costa <glommer@parallels.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _LINUX_KMEM_CGROUP_H
+#define _LINUX_KMEM_CGROUP_H
+#include <linux/cgroup.h>
+#include <linux/atomic.h>
+#include <linux/percpu_counter.h>
+
+struct kmem_cgroup {
+	struct cgroup_subsys_state css;
+	struct kmem_cgroup *parent;
+};
+
+
+#ifdef CONFIG_CGROUP_KMEM
+static inline struct kmem_cgroup *kcg_from_cgroup(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, kmem_subsys_id),
+		struct kmem_cgroup, css);
+}
+
+static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, kmem_subsys_id),
+		struct kmem_cgroup, css);
+}
+#else
+static inline struct kmem_cgroup *kcg_from_cgroup(struct cgroup *cgrp)
+{
+	return NULL;
+}
+
+static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
+{
+	return NULL;
+}
+#endif /* CONFIG_CGROUP_KMEM */
+#endif /* _LINUX_KMEM_CGROUP_H */
+
diff --git a/init/Kconfig b/init/Kconfig
index d627783..5955ac2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -690,6 +690,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
 
+config CGROUP_KMEM
+	bool "Kernel Memory Resource Controller for Control Groups"
+	depends on CGROUPS
+	help
+	  The Kernel Memory cgroup can limit the amount of memory used by
+	  certain kernel objects in the system. Those are fundamentally
+	  different from the entities handled by the Memory Controller,
+	  which are page-based, and can be swapped. Users of the kmem
+	  cgroup can use it to guarantee that no group of processes will
+	  ever exhaust kernel resources alone.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 836e416..1b1aa24 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -45,6 +45,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_KMEM) += kmem_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/kmem_cgroup.c b/mm/kmem_cgroup.c
new file mode 100644
index 0000000..7950e69
--- /dev/null
+++ b/mm/kmem_cgroup.c
@@ -0,0 +1,53 @@
+/* kmem_cgroup.c - Kernel Memory Controller
+ *
+ * Copyright Parallels Inc, 2011
+ * Author: Glauber Costa <glommer@parallels.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/kmem_cgroup.h>
+
+static int kmem_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return 0;
+}
+
+static void
+kmem_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct kmem_cgroup *cg = kcg_from_cgroup(cgrp);
+	kfree(cg);
+}
+
+static struct cgroup_subsys_state *kmem_create(
+	struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);
+
+	if (!sk)
+		return ERR_PTR(-ENOMEM);
+
+	if (cgrp->parent)
+		sk->parent = kcg_from_cgroup(cgrp->parent);
+
+	return &sk->css;
+}
+
+struct cgroup_subsys kmem_subsys = {
+	.name = "kmem",
+	.create = kmem_create,
+	.destroy = kmem_destroy,
+	.populate = kmem_populate,
+	.subsys_id = kmem_subsys_id,
+};
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

We aim to control the amount of kernel memory pinned at any
time by tcp sockets. To lay the foundations for this work,
this patch adds a pointer to the kmem_cgroup to the socket
structure.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
 include/net/sock.h          |    2 ++
 net/core/sock.c             |    5 ++---
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index 0e4a74b..77076d8 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
 	return NULL;
 }
 #endif /* CONFIG_CGROUP_KMEM */
+
+#ifdef CONFIG_INET
+#include <net/sock.h>
+static inline void sock_update_kmem_cgrp(struct sock *sk)
+{
+#ifdef CONFIG_CGROUP_KMEM
+	sk->sk_cgrp = kcg_from_task(current);
+
+	/*
+	 * We don't need to protect against anything task-related, because
+	 * we are basically stuck with the sock pointer that won't change,
+	 * even if the task that originated the socket changes cgroups.
+	 *
+	 * What we do have to guarantee, is that the chain leading us to
+	 * the top level won't change under our noses. Incrementing the
+	 * reference count via cgroup_exclude_rmdir guarantees that.
+	 */
+	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
+#endif
+}
+
+static inline void sock_release_kmem_cgrp(struct sock *sk)
+{
+#ifdef CONFIG_CGROUP_KMEM
+	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
+#endif
+}
+
+#endif /* CONFIG_INET */
 #endif /* _LINUX_KMEM_CGROUP_H */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..709382f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -228,6 +228,7 @@ struct sock_common {
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
+  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup 
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -339,6 +340,7 @@ struct sock {
 #endif
 	__u32			sk_mark;
 	u32			sk_classid;
+	struct kmem_cgroup	*sk_cgrp;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index 3449df8..7109864 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1139,6 +1139,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_kmem_cgrp(sk);
 	}
 
 	return sk;
@@ -1170,6 +1171,7 @@ static void __sk_free(struct sock *sk)
 		put_cred(sk->sk_peer_cred);
 	put_pid(sk->sk_peer_pid);
 	put_net(sock_net(sk));
+	sock_release_kmem_cgrp(sk);
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
@@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

We aim to control the amount of kernel memory pinned at any
time by tcp sockets. To lay the foundations for this work,
this patch adds a pointer to the kmem_cgroup to the socket
structure.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
 include/net/sock.h          |    2 ++
 net/core/sock.c             |    5 ++---
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index 0e4a74b..77076d8 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
 	return NULL;
 }
 #endif /* CONFIG_CGROUP_KMEM */
+
+#ifdef CONFIG_INET
+#include <net/sock.h>
+static inline void sock_update_kmem_cgrp(struct sock *sk)
+{
+#ifdef CONFIG_CGROUP_KMEM
+	sk->sk_cgrp = kcg_from_task(current);
+
+	/*
+	 * We don't need to protect against anything task-related, because
+	 * we are basically stuck with the sock pointer that won't change,
+	 * even if the task that originated the socket changes cgroups.
+	 *
+	 * What we do have to guarantee, is that the chain leading us to
+	 * the top level won't change under our noses. Incrementing the
+	 * reference count via cgroup_exclude_rmdir guarantees that.
+	 */
+	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
+#endif
+}
+
+static inline void sock_release_kmem_cgrp(struct sock *sk)
+{
+#ifdef CONFIG_CGROUP_KMEM
+	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
+#endif
+}
+
+#endif /* CONFIG_INET */
 #endif /* _LINUX_KMEM_CGROUP_H */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..709382f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -228,6 +228,7 @@ struct sock_common {
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
+  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup 
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -339,6 +340,7 @@ struct sock {
 #endif
 	__u32			sk_mark;
 	u32			sk_classid;
+	struct kmem_cgroup	*sk_cgrp;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index 3449df8..7109864 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1139,6 +1139,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_kmem_cgrp(sk);
 	}
 
 	return sk;
@@ -1170,6 +1171,7 @@ static void __sk_free(struct sock *sk)
 		put_cred(sk->sk_peer_cred);
 	put_pid(sk->sk_peer_pid);
 	put_net(sock_net(sk));
+	sock_release_kmem_cgrp(sk);
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
@@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 4/9] function wrappers for upcoming socket
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

Instead of dealing with global values for memory pressure scenarios,
per-cgroup values will be needed. This patch just writes down the
acessor functions to be used later.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |  104 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index 77076d8..d983ba8 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -52,6 +52,110 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
 
 #ifdef CONFIG_INET
 #include <net/sock.h>
+static inline int *sk_memory_pressure(struct sock *sk)
+{
+	int *ret = NULL;
+	if (sk->sk_prot->memory_pressure)
+		ret = sk->sk_prot->memory_pressure(sk->sk_cgrp);
+	return ret;
+}
+
+static inline long sk_prot_mem(struct sock *sk, int index)
+{
+	long *prot = sk->sk_prot->prot_mem(sk->sk_cgrp);
+	return prot[index];
+}
+
+static inline long
+sk_memory_allocated(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
+static inline long
+sk_memory_allocated_add(struct sock *sk, int amt, int *parent_failure)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+	long allocated = atomic_long_add_return(amt, prot->memory_allocated(cg));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = cg->parent; cg != NULL; cg = cg->parent) {
+		long alloc;
+		/*
+		 * Large nestings are not the common case, and stopping in the
+		 * middle would be complicated enough, that we bill it all the
+		 * way through the root, and if needed, unbill everything later
+		 */
+		alloc = atomic_long_add_return(amt, prot->memory_allocated(cg));
+		*parent_failure |= (alloc > sk_prot_mem(sk, 2));
+	}
+#endif
+	return allocated;
+}
+
+static inline void
+sk_memory_allocated_sub(struct sock *sk, int amt)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	atomic_long_sub(amt, prot->memory_allocated(cg));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = sk->sk_cgrp->parent; cg != NULL; cg = cg->parent)
+		atomic_long_sub(amt, prot->memory_allocated(cg));
+#endif
+}
+
+static inline void sk_sockets_allocated_dec(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_dec(prot->sockets_allocated(cg));
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = sk->sk_cgrp->parent; cg; cg = cg->parent)
+		percpu_counter_dec(prot->sockets_allocated(cg));
+#endif
+}
+
+static inline void sk_sockets_allocated_inc(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_inc(prot->sockets_allocated(cg));
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = sk->sk_cgrp->parent; cg; cg = cg->parent)
+		percpu_counter_inc(prot->sockets_allocated(cg));
+#endif
+}
+
+static inline int
+sk_sockets_allocated_read_positive(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline int
+kcg_sockets_allocated_sum_positive(struct proto *prot, struct kmem_cgroup *cg)
+{
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline long
+kcg_memory_allocated(struct proto *prot, struct kmem_cgroup *cg)
+{
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
 static inline void sock_update_kmem_cgrp(struct sock *sk)
 {
 #ifdef CONFIG_CGROUP_KMEM
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 4/9] function wrappers for upcoming socket
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

Instead of dealing with global values for memory pressure scenarios,
per-cgroup values will be needed. This patch just writes down the
acessor functions to be used later.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |  104 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index 77076d8..d983ba8 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -52,6 +52,110 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
 
 #ifdef CONFIG_INET
 #include <net/sock.h>
+static inline int *sk_memory_pressure(struct sock *sk)
+{
+	int *ret = NULL;
+	if (sk->sk_prot->memory_pressure)
+		ret = sk->sk_prot->memory_pressure(sk->sk_cgrp);
+	return ret;
+}
+
+static inline long sk_prot_mem(struct sock *sk, int index)
+{
+	long *prot = sk->sk_prot->prot_mem(sk->sk_cgrp);
+	return prot[index];
+}
+
+static inline long
+sk_memory_allocated(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
+static inline long
+sk_memory_allocated_add(struct sock *sk, int amt, int *parent_failure)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+	long allocated = atomic_long_add_return(amt, prot->memory_allocated(cg));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = cg->parent; cg != NULL; cg = cg->parent) {
+		long alloc;
+		/*
+		 * Large nestings are not the common case, and stopping in the
+		 * middle would be complicated enough, that we bill it all the
+		 * way through the root, and if needed, unbill everything later
+		 */
+		alloc = atomic_long_add_return(amt, prot->memory_allocated(cg));
+		*parent_failure |= (alloc > sk_prot_mem(sk, 2));
+	}
+#endif
+	return allocated;
+}
+
+static inline void
+sk_memory_allocated_sub(struct sock *sk, int amt)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	atomic_long_sub(amt, prot->memory_allocated(cg));
+
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = sk->sk_cgrp->parent; cg != NULL; cg = cg->parent)
+		atomic_long_sub(amt, prot->memory_allocated(cg));
+#endif
+}
+
+static inline void sk_sockets_allocated_dec(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_dec(prot->sockets_allocated(cg));
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = sk->sk_cgrp->parent; cg; cg = cg->parent)
+		percpu_counter_dec(prot->sockets_allocated(cg));
+#endif
+}
+
+static inline void sk_sockets_allocated_inc(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	percpu_counter_inc(prot->sockets_allocated(cg));
+#ifdef CONFIG_CGROUP_KMEM
+	for (cg = sk->sk_cgrp->parent; cg; cg = cg->parent)
+		percpu_counter_inc(prot->sockets_allocated(cg));
+#endif
+}
+
+static inline int
+sk_sockets_allocated_read_positive(struct sock *sk)
+{
+	struct proto *prot = sk->sk_prot;
+	struct kmem_cgroup *cg = sk->sk_cgrp;
+
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline int
+kcg_sockets_allocated_sum_positive(struct proto *prot, struct kmem_cgroup *cg)
+{
+	return percpu_counter_sum_positive(prot->sockets_allocated(cg));
+}
+
+static inline long
+kcg_memory_allocated(struct proto *prot, struct kmem_cgroup *cg)
+{
+	return atomic_long_read(prot->memory_allocated(cg));
+}
+
 static inline void sock_update_kmem_cgrp(struct sock *sk)
 {
 #ifdef CONFIG_CGROUP_KMEM
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 5/9] foundations of per-cgroup memory pressure controlling.
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch converts struct sock fields memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem (now prot_mem)
to function pointers, receiving a struct kmem_cgroup parameter.

enter_memory_pressure is kept the same, since all its callers
have socket a context, and the kmem_cgroup can be derived from
the socket itself.

To keep things working, the patch convert all users of those fields
to use acessor functions.

In my benchmarks I didn't see a significant performance difference
with this patch applied compared to a baseline (around 1 % diff, thus
inside error margin).

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 crypto/af_alg.c             |    7 ++++-
 include/net/sock.h          |   27 ++++++++++++++---
 include/net/tcp.h           |   12 +++++--
 include/net/udp.h           |    3 +-
 include/trace/events/sock.h |   10 +++---
 net/core/sock.c             |   65 +++++++++++++++++++++++++------------------
 net/decnet/af_decnet.c      |   21 ++++++++++++--
 net/ipv4/proc.c             |    7 ++--
 net/ipv4/tcp.c              |   30 +++++++++++++++++--
 net/ipv4/tcp_input.c        |   12 ++++----
 net/ipv4/tcp_ipv4.c         |   15 ++++++----
 net/ipv4/tcp_output.c       |    2 +-
 net/ipv4/tcp_timer.c        |    2 +-
 net/ipv4/udp.c              |   20 ++++++++++---
 net/ipv6/tcp_ipv6.c         |   10 +++---
 net/ipv6/udp.c              |    4 +-
 net/sctp/socket.c           |   35 ++++++++++++++++++-----
 17 files changed, 195 insertions(+), 87 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index ac33d5f..df168d8 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -29,10 +29,15 @@ struct alg_type_list {
 
 static atomic_long_t alg_memory_allocated;
 
+static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
+{
+	return &alg_memory_allocated;
+}
+
 static struct proto alg_proto = {
 	.name			= "ALG",
 	.owner			= THIS_MODULE,
-	.memory_allocated	= &alg_memory_allocated,
+	.memory_allocated	= memory_allocated_alg,
 	.obj_size		= sizeof(struct alg_sock),
 };
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 709382f..ab65640 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,6 +54,7 @@
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -168,6 +169,8 @@ struct sock_common {
 	/* public: */
 };
 
+struct kmem_cgroup;
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -786,18 +789,32 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
+	/*
+	 * per-cgroup memory tracking:
+	 *
+	 * The following functions track memory consumption of network buffers
+	 * by cgroup (kmem_cgroup) for the current protocol. As of the rest
+	 * of the fields in this structure, not all protocols are required
+	 * to implement them. Protocols that don't want to do per-cgroup
+	 * memory pressure management, can just assume the root cgroup is used.
+	 * 
+	 */
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
+	/* Pointer to the current memory allocation of this cgroup. */
+	atomic_long_t		*(*memory_allocated)(struct kmem_cgroup *sg);
+	/* Pointer to the current number of sockets in this cgroup. */
+	struct percpu_counter	*(*sockets_allocated)(struct kmem_cgroup *sg);
 	/*
-	 * Pressure flag: try to collapse.
+	 * Per cgroup pointer to the pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
 	 * All the __sk_mem_schedule() is of this nature: accounting
 	 * is strict, actions are advisory and have some latency.
 	 */
-	int			*memory_pressure;
-	long			*sysctl_mem;
+	int			*(*memory_pressure)(struct kmem_cgroup *sg);
+	/* Pointer to the per-cgroup version of the the sysctl_mem field */
+	long			*(*prot_mem)(struct kmem_cgroup *sg);
+
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6bfdd9b..06b6865 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -45,6 +45,7 @@
 #include <net/dst.h>
 
 #include <linux/seq_file.h>
+#include <linux/kmem_cgroup.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -252,9 +253,12 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
-extern atomic_long_t tcp_memory_allocated;
-extern struct percpu_counter tcp_sockets_allocated;
-extern int tcp_memory_pressure;
+struct kmem_cgroup;
+extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
+int *memory_pressure_tcp(struct kmem_cgroup *sg);
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -285,7 +289,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    sk_memory_allocated(sk) > sk_prot_mem(sk, 2))
 		return true;
 	return false;
 }
diff --git a/include/net/udp.h b/include/net/udp.h
index 67ea6fc..0e27388 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
-extern atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
+long *udp_sysctl_mem(struct kmem_cgroup *sg);
 
 /* sysctl variables for udp */
 extern long sysctl_udp_mem[3];
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 779abb9..12a6083 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__field(long *, prot_mem)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = prot->sysctl_rmem[0];
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
@@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
 		"sysctl_rmem=%d rmem_alloc=%d",
 		__entry->name,
-		__entry->sysctl_mem[0],
-		__entry->sysctl_mem[1],
-		__entry->sysctl_mem[2],
+		__entry->prot_mem[0],
+		__entry->prot_mem[1],
+		__entry->prot_mem[2],
 		__entry->allocated,
 		__entry->sysctl_rmem,
 		__entry->rmem_alloc)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7109864..ead9c02 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1290,7 +1290,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		newsk->sk_wq = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+			sk_sockets_allocated_inc(newsk);
 
 		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
 		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1681,30 +1681,33 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
+	int *memory_pressure;
+	int parent_failure = 0;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_long_add_return(amt, prot->memory_allocated);
 
-	/* Under limit. */
-	if (allocated <= prot->sysctl_mem[0]) {
-		if (prot->memory_pressure && *prot->memory_pressure)
-			*prot->memory_pressure = 0;
-		return 1;
-	}
+	memory_pressure = sk_memory_pressure(sk);
+	allocated = sk_memory_allocated_add(sk, amt, &parent_failure);
+
+	/* Over hard limit (we, or our parents) */
+	if (parent_failure || (allocated > sk_prot_mem(sk, 2)))
+		goto suppress_allocation;
 
-	/* Under pressure. */
-	if (allocated > prot->sysctl_mem[1])
+ 	/* Under limit. */
+	if (allocated <= sk_prot_mem(sk, 0))
+		if (memory_pressure && *memory_pressure)
+			*memory_pressure = 0;
+
+ 	/* Under pressure. */
+	if (allocated > sk_prot_mem(sk, 1))
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
-	/* Over hard limit. */
-	if (allocated > prot->sysctl_mem[2])
-		goto suppress_allocation;
-
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
 			return 1;
+
 	} else { /* SK_MEM_SEND */
 		if (sk->sk_type == SOCK_STREAM) {
 			if (sk->sk_wmem_queued < prot->sysctl_wmem[0])
@@ -1714,13 +1717,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 				return 1;
 	}
 
-	if (prot->memory_pressure) {
+	if (memory_pressure) {
 		int alloc;
 
-		if (!*prot->memory_pressure)
+		if (!*memory_pressure)
 			return 1;
-		alloc = percpu_counter_read_positive(prot->sockets_allocated);
-		if (prot->sysctl_mem[2] > alloc *
+		alloc = sk_sockets_allocated_read_positive(sk);
+		if (sk_prot_mem(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
 				 sk->sk_forward_alloc))
@@ -1743,7 +1746,9 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_long_sub(amt, prot->memory_allocated);
+
+	sk_memory_allocated_sub(sk, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1754,15 +1759,14 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reclaim(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
+	int *memory_pressure = sk_memory_pressure(sk);
 
-	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
+	sk_memory_allocated_sub(sk, sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	if (memory_pressure && *memory_pressure &&
+	    (sk_memory_allocated(sk) < sk_prot_mem(sk, 0)))
+		*memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
@@ -2478,13 +2482,20 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
+	struct kmem_cgroup *cg = kcg_from_task(current);
+	int *memory_pressure = NULL;
+
+	if (proto->memory_pressure)
+		memory_pressure = proto->memory_pressure(cg);
+
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
-		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+		   proto->memory_allocated != NULL ?
+			kcg_memory_allocated(proto, cg) : -1L,
+		   memory_pressure != NULL ? *memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
 		   module_name(proto->owner),
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 19acd00..463b299 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
 	}
 }
 
+static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
+{
+	return &decnet_memory_allocated;
+}
+
+static int *memory_pressure_dn(struct kmem_cgroup *sg)
+{
+	return &dn_memory_pressure;
+}
+
+static long *dn_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_decnet_mem;
+}
+
 static struct proto dn_proto = {
 	.name			= "NSP",
 	.owner			= THIS_MODULE,
 	.enter_memory_pressure	= dn_enter_memory_pressure,
-	.memory_pressure	= &dn_memory_pressure,
-	.memory_allocated	= &decnet_memory_allocated,
-	.sysctl_mem		= sysctl_decnet_mem,
+	.memory_pressure	= memory_pressure_dn,
+	.memory_allocated	= memory_allocated_dn,
+	.prot_mem		= dn_sysctl_mem,
 	.sysctl_wmem		= sysctl_decnet_wmem,
 	.sysctl_rmem		= sysctl_decnet_rmem,
 	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b14ec7d..ebe938f 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -52,20 +52,21 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
 	int orphans, sockets;
+	struct kmem_cgroup *cg = kcg_from_task(current);
 
 	local_bh_disable();
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
-	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+	sockets = kcg_sockets_allocated_sum_positive(&tcp_prot, cg);
 	local_bh_enable();
 
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_long_read(&tcp_memory_allocated));
+		   kcg_memory_allocated(&tcp_prot, cg));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_long_read(&udp_memory_allocated));
+		   kcg_memory_allocated(&udp_prot, cg));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f06df24..76f03ed 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,13 +290,11 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-EXPORT_SYMBOL(tcp_memory_allocated);
 
 /*
  * Current number of TCP sockets.
  */
 struct percpu_counter tcp_sockets_allocated;
-EXPORT_SYMBOL(tcp_sockets_allocated);
 
 /*
  * TCP splice context
@@ -314,16 +312,40 @@ struct tcp_splice_state {
  * is strict, actions are advisory and have some latency.
  */
 int tcp_memory_pressure __read_mostly;
-EXPORT_SYMBOL(tcp_memory_pressure);
 
-void tcp_enter_memory_pressure(struct sock *sk)
+int *memory_pressure_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_sockets_allocated;
+}
+
+void tcp_enter_memory_pressure(struct sock *sock)
 {
 	if (!tcp_memory_pressure) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
 		tcp_memory_pressure = 1;
 	}
 }
+
+long *tcp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return init_net.ipv4.sysctl_tcp_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_memory_allocated;
+}
+
+EXPORT_SYMBOL(memory_pressure_tcp);
+EXPORT_SYMBOL(sockets_allocated_tcp);
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL(tcp_sysctl_mem);
+EXPORT_SYMBOL(memory_allocated_tcp);
 
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..3f17423 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
 	/* Check #1 */
 	if (tp->rcv_ssthresh < tp->window_clamp &&
 	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-	    !tcp_memory_pressure) {
+	    !sk_memory_pressure(sk)) {
 		int incr;
 
 		/* Check #2. Increase window, if skb with such overhead
@@ -398,8 +398,8 @@ static void tcp_clamp_window(struct sock *sk)
 
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
-	    !tcp_memory_pressure &&
-	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    !sk_memory_pressure(sk) &&
+	    sk_memory_allocated(sk) < sk_prot_mem(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -4806,7 +4806,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		tcp_clamp_window(sk);
-	else if (tcp_memory_pressure)
+	else if (sk_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
 	tcp_collapse_ofo_queue(sk);
@@ -4872,11 +4872,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
 		return 0;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (sk_memory_allocated(sk) >= sk_prot_mem(sk, 0))
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1c12b8e..69a02fa 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1901,7 +1901,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -1957,7 +1957,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
-	percpu_counter_dec(&tcp_sockets_allocated);
+	sk_sockets_allocated_dec(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2598,11 +2598,14 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.memory_allocated	= memory_allocated_tcp,
+#ifdef CONFIG_CGROUP_KMEM
+	.init_cgroup		= tcp_init_cgroup,
+#endif
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..06aeb31 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_memory_pressure)
+		if (sk_memory_pressure(sk))
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..2c67617 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
 	}
 
 out:
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		sk_mem_reclaim(sk);
 out_unlock:
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1b5a193..6c08c65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
 int sysctl_udp_wmem_min __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_wmem_min);
 
-atomic_long_t udp_memory_allocated;
-EXPORT_SYMBOL(udp_memory_allocated);
-
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
@@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(udp_poll);
 
+static atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg)
+{
+	return &udp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_udp);
+
+long *udp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_udp_mem;
+}
+EXPORT_SYMBOL(udp_sysctl_mem);
+
 struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
@@ -1936,8 +1946,8 @@ struct proto udp_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = &memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp_sock),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d1fb63f..807797a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2012,7 +2012,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -2221,11 +2221,11 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 29213b5..ef4b5b3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp6_sock),
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 836aa63..1b0300d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -119,11 +119,30 @@ static int sctp_memory_pressure;
 static atomic_long_t sctp_memory_allocated;
 struct percpu_counter sctp_sockets_allocated;
 
+static long *sctp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_sctp_mem;
+}
+
 static void sctp_enter_memory_pressure(struct sock *sk)
 {
 	sctp_memory_pressure = 1;
 }
 
+static int *memory_pressure_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_memory_pressure;
+}
+
+static atomic_long_t *memory_allocated_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_memory_allocated;
+}
+
+static struct percpu_counter *sockets_allocated_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_sockets_allocated;
+}
 
 /* Get the sndbuf space available at the time on the association.  */
 static inline int sctp_wspace(struct sctp_association *asoc)
@@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
 	.unhash      =	sctp_unhash,
 	.get_port    =	sctp_get_port,
 	.obj_size    =  sizeof(struct sctp_sock),
-	.sysctl_mem  =  sysctl_sctp_mem,
+	.prot_mem    =  sctp_sysctl_mem,
 	.sysctl_rmem =  sysctl_sctp_rmem,
 	.sysctl_wmem =  sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
 	.unhash		= sctp_unhash,
 	.get_port	= sctp_get_port,
 	.obj_size	= sizeof(struct sctp6_sock),
-	.sysctl_mem	= sysctl_sctp_mem,
+	.prot_mem	= sctp_sysctl_mem,
 	.sysctl_rmem	= sysctl_sctp_rmem,
 	.sysctl_wmem	= sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 5/9] foundations of per-cgroup memory pressure controlling.
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch converts struct sock fields memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem (now prot_mem)
to function pointers, receiving a struct kmem_cgroup parameter.

enter_memory_pressure is kept the same, since all its callers
have socket a context, and the kmem_cgroup can be derived from
the socket itself.

To keep things working, the patch convert all users of those fields
to use acessor functions.

In my benchmarks I didn't see a significant performance difference
with this patch applied compared to a baseline (around 1 % diff, thus
inside error margin).

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 crypto/af_alg.c             |    7 ++++-
 include/net/sock.h          |   27 ++++++++++++++---
 include/net/tcp.h           |   12 +++++--
 include/net/udp.h           |    3 +-
 include/trace/events/sock.h |   10 +++---
 net/core/sock.c             |   65 +++++++++++++++++++++++++------------------
 net/decnet/af_decnet.c      |   21 ++++++++++++--
 net/ipv4/proc.c             |    7 ++--
 net/ipv4/tcp.c              |   30 +++++++++++++++++--
 net/ipv4/tcp_input.c        |   12 ++++----
 net/ipv4/tcp_ipv4.c         |   15 ++++++----
 net/ipv4/tcp_output.c       |    2 +-
 net/ipv4/tcp_timer.c        |    2 +-
 net/ipv4/udp.c              |   20 ++++++++++---
 net/ipv6/tcp_ipv6.c         |   10 +++---
 net/ipv6/udp.c              |    4 +-
 net/sctp/socket.c           |   35 ++++++++++++++++++-----
 17 files changed, 195 insertions(+), 87 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index ac33d5f..df168d8 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -29,10 +29,15 @@ struct alg_type_list {
 
 static atomic_long_t alg_memory_allocated;
 
+static atomic_long_t *memory_allocated_alg(struct kmem_cgroup *sg)
+{
+	return &alg_memory_allocated;
+}
+
 static struct proto alg_proto = {
 	.name			= "ALG",
 	.owner			= THIS_MODULE,
-	.memory_allocated	= &alg_memory_allocated,
+	.memory_allocated	= memory_allocated_alg,
 	.obj_size		= sizeof(struct alg_sock),
 };
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 709382f..ab65640 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,6 +54,7 @@
 #include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/cgroup.h>
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
@@ -168,6 +169,8 @@ struct sock_common {
 	/* public: */
 };
 
+struct kmem_cgroup;
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -786,18 +789,32 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
+	/*
+	 * per-cgroup memory tracking:
+	 *
+	 * The following functions track memory consumption of network buffers
+	 * by cgroup (kmem_cgroup) for the current protocol. As of the rest
+	 * of the fields in this structure, not all protocols are required
+	 * to implement them. Protocols that don't want to do per-cgroup
+	 * memory pressure management, can just assume the root cgroup is used.
+	 * 
+	 */
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
+	/* Pointer to the current memory allocation of this cgroup. */
+	atomic_long_t		*(*memory_allocated)(struct kmem_cgroup *sg);
+	/* Pointer to the current number of sockets in this cgroup. */
+	struct percpu_counter	*(*sockets_allocated)(struct kmem_cgroup *sg);
 	/*
-	 * Pressure flag: try to collapse.
+	 * Per cgroup pointer to the pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
 	 * All the __sk_mem_schedule() is of this nature: accounting
 	 * is strict, actions are advisory and have some latency.
 	 */
-	int			*memory_pressure;
-	long			*sysctl_mem;
+	int			*(*memory_pressure)(struct kmem_cgroup *sg);
+	/* Pointer to the per-cgroup version of the the sysctl_mem field */
+	long			*(*prot_mem)(struct kmem_cgroup *sg);
+
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6bfdd9b..06b6865 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -45,6 +45,7 @@
 #include <net/dst.h>
 
 #include <linux/seq_file.h>
+#include <linux/kmem_cgroup.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -252,9 +253,12 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
-extern atomic_long_t tcp_memory_allocated;
-extern struct percpu_counter tcp_sockets_allocated;
-extern int tcp_memory_pressure;
+struct kmem_cgroup;
+extern long *tcp_sysctl_mem(struct kmem_cgroup *sg);
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg);
+int *memory_pressure_tcp(struct kmem_cgroup *sg);
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg);
 
 /*
  * The next routines deal with comparing 32 bit unsigned ints
@@ -285,7 +289,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    sk_memory_allocated(sk) > sk_prot_mem(sk, 2))
 		return true;
 	return false;
 }
diff --git a/include/net/udp.h b/include/net/udp.h
index 67ea6fc..0e27388 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -105,7 +105,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
-extern atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg);
+long *udp_sysctl_mem(struct kmem_cgroup *sg);
 
 /* sysctl variables for udp */
 extern long sysctl_udp_mem[3];
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 779abb9..12a6083 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -37,7 +37,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__field(long *, prot_mem)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -45,7 +45,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->prot_mem = sk->sk_prot->prot_mem(sk->sk_cgrp);
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = prot->sysctl_rmem[0];
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
@@ -54,9 +54,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
 		"sysctl_rmem=%d rmem_alloc=%d",
 		__entry->name,
-		__entry->sysctl_mem[0],
-		__entry->sysctl_mem[1],
-		__entry->sysctl_mem[2],
+		__entry->prot_mem[0],
+		__entry->prot_mem[1],
+		__entry->prot_mem[2],
 		__entry->allocated,
 		__entry->sysctl_rmem,
 		__entry->rmem_alloc)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7109864..ead9c02 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1290,7 +1290,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		newsk->sk_wq = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+			sk_sockets_allocated_inc(newsk);
 
 		if (sock_flag(newsk, SOCK_TIMESTAMP) ||
 		    sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1681,30 +1681,33 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
+	int *memory_pressure;
+	int parent_failure = 0;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_long_add_return(amt, prot->memory_allocated);
 
-	/* Under limit. */
-	if (allocated <= prot->sysctl_mem[0]) {
-		if (prot->memory_pressure && *prot->memory_pressure)
-			*prot->memory_pressure = 0;
-		return 1;
-	}
+	memory_pressure = sk_memory_pressure(sk);
+	allocated = sk_memory_allocated_add(sk, amt, &parent_failure);
+
+	/* Over hard limit (we, or our parents) */
+	if (parent_failure || (allocated > sk_prot_mem(sk, 2)))
+		goto suppress_allocation;
 
-	/* Under pressure. */
-	if (allocated > prot->sysctl_mem[1])
+ 	/* Under limit. */
+	if (allocated <= sk_prot_mem(sk, 0))
+		if (memory_pressure && *memory_pressure)
+			*memory_pressure = 0;
+
+ 	/* Under pressure. */
+	if (allocated > sk_prot_mem(sk, 1))
 		if (prot->enter_memory_pressure)
 			prot->enter_memory_pressure(sk);
 
-	/* Over hard limit. */
-	if (allocated > prot->sysctl_mem[2])
-		goto suppress_allocation;
-
 	/* guarantee minimum buffer size under pressure */
 	if (kind == SK_MEM_RECV) {
 		if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
 			return 1;
+
 	} else { /* SK_MEM_SEND */
 		if (sk->sk_type == SOCK_STREAM) {
 			if (sk->sk_wmem_queued < prot->sysctl_wmem[0])
@@ -1714,13 +1717,13 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 				return 1;
 	}
 
-	if (prot->memory_pressure) {
+	if (memory_pressure) {
 		int alloc;
 
-		if (!*prot->memory_pressure)
+		if (!*memory_pressure)
 			return 1;
-		alloc = percpu_counter_read_positive(prot->sockets_allocated);
-		if (prot->sysctl_mem[2] > alloc *
+		alloc = sk_sockets_allocated_read_positive(sk);
+		if (sk_prot_mem(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
 				 sk->sk_forward_alloc))
@@ -1743,7 +1746,9 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_long_sub(amt, prot->memory_allocated);
+
+	sk_memory_allocated_sub(sk, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1754,15 +1759,14 @@ EXPORT_SYMBOL(__sk_mem_schedule);
  */
 void __sk_mem_reclaim(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
+	int *memory_pressure = sk_memory_pressure(sk);
 
-	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
+	sk_memory_allocated_sub(sk, sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	if (memory_pressure && *memory_pressure &&
+	    (sk_memory_allocated(sk) < sk_prot_mem(sk, 0)))
+		*memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
@@ -2478,13 +2482,20 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
+	struct kmem_cgroup *cg = kcg_from_task(current);
+	int *memory_pressure = NULL;
+
+	if (proto->memory_pressure)
+		memory_pressure = proto->memory_pressure(cg);
+
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
-		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+		   proto->memory_allocated != NULL ?
+			kcg_memory_allocated(proto, cg) : -1L,
+		   memory_pressure != NULL ? *memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
 		   module_name(proto->owner),
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 19acd00..463b299 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -458,13 +458,28 @@ static void dn_enter_memory_pressure(struct sock *sk)
 	}
 }
 
+static atomic_long_t *memory_allocated_dn(struct kmem_cgroup *sg)
+{
+	return &decnet_memory_allocated;
+}
+
+static int *memory_pressure_dn(struct kmem_cgroup *sg)
+{
+	return &dn_memory_pressure;
+}
+
+static long *dn_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_decnet_mem;
+}
+
 static struct proto dn_proto = {
 	.name			= "NSP",
 	.owner			= THIS_MODULE,
 	.enter_memory_pressure	= dn_enter_memory_pressure,
-	.memory_pressure	= &dn_memory_pressure,
-	.memory_allocated	= &decnet_memory_allocated,
-	.sysctl_mem		= sysctl_decnet_mem,
+	.memory_pressure	= memory_pressure_dn,
+	.memory_allocated	= memory_allocated_dn,
+	.prot_mem		= dn_sysctl_mem,
 	.sysctl_wmem		= sysctl_decnet_wmem,
 	.sysctl_rmem		= sysctl_decnet_rmem,
 	.max_header		= DN_MAX_NSP_DATA_HEADER + 64,
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b14ec7d..ebe938f 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -52,20 +52,21 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
 	int orphans, sockets;
+	struct kmem_cgroup *cg = kcg_from_task(current);
 
 	local_bh_disable();
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
-	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+	sockets = kcg_sockets_allocated_sum_positive(&tcp_prot, cg);
 	local_bh_enable();
 
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_long_read(&tcp_memory_allocated));
+		   kcg_memory_allocated(&tcp_prot, cg));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_long_read(&udp_memory_allocated));
+		   kcg_memory_allocated(&udp_prot, cg));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f06df24..76f03ed 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,13 +290,11 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-EXPORT_SYMBOL(tcp_memory_allocated);
 
 /*
  * Current number of TCP sockets.
  */
 struct percpu_counter tcp_sockets_allocated;
-EXPORT_SYMBOL(tcp_sockets_allocated);
 
 /*
  * TCP splice context
@@ -314,16 +312,40 @@ struct tcp_splice_state {
  * is strict, actions are advisory and have some latency.
  */
 int tcp_memory_pressure __read_mostly;
-EXPORT_SYMBOL(tcp_memory_pressure);
 
-void tcp_enter_memory_pressure(struct sock *sk)
+int *memory_pressure_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_sockets_allocated;
+}
+
+void tcp_enter_memory_pressure(struct sock *sock)
 {
 	if (!tcp_memory_pressure) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
 		tcp_memory_pressure = 1;
 	}
 }
+
+long *tcp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return init_net.ipv4.sysctl_tcp_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &tcp_memory_allocated;
+}
+
+EXPORT_SYMBOL(memory_pressure_tcp);
+EXPORT_SYMBOL(sockets_allocated_tcp);
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL(tcp_sysctl_mem);
+EXPORT_SYMBOL(memory_allocated_tcp);
 
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..3f17423 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -316,7 +316,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
 	/* Check #1 */
 	if (tp->rcv_ssthresh < tp->window_clamp &&
 	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-	    !tcp_memory_pressure) {
+	    !sk_memory_pressure(sk)) {
 		int incr;
 
 		/* Check #2. Increase window, if skb with such overhead
@@ -398,8 +398,8 @@ static void tcp_clamp_window(struct sock *sk)
 
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
-	    !tcp_memory_pressure &&
-	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    !sk_memory_pressure(sk) &&
+	    sk_memory_allocated(sk) < sk_prot_mem(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -4806,7 +4806,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		tcp_clamp_window(sk);
-	else if (tcp_memory_pressure)
+	else if (sk_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
 	tcp_collapse_ofo_queue(sk);
@@ -4872,11 +4872,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
 		return 0;
 
 	/* If we are under global TCP memory pressure, do not expand.  */
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (sk_memory_allocated(sk) >= sk_prot_mem(sk, 0))
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1c12b8e..69a02fa 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1901,7 +1901,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -1957,7 +1957,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
-	percpu_counter_dec(&tcp_sockets_allocated);
+	sk_sockets_allocated_dec(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2598,11 +2598,14 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
+	.memory_pressure	= memory_pressure_tcp,
+	.sockets_allocated	= sockets_allocated_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.memory_allocated	= memory_allocated_tcp,
+#ifdef CONFIG_CGROUP_KMEM
+	.init_cgroup		= tcp_init_cgroup,
+#endif
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..06aeb31 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1912,7 +1912,7 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_memory_pressure)
+		if (sk_memory_pressure(sk))
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..2c67617 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -261,7 +261,7 @@ static void tcp_delack_timer(unsigned long data)
 	}
 
 out:
-	if (tcp_memory_pressure)
+	if (sk_memory_pressure(sk))
 		sk_mem_reclaim(sk);
 out_unlock:
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1b5a193..6c08c65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
 int sysctl_udp_wmem_min __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_wmem_min);
 
-atomic_long_t udp_memory_allocated;
-EXPORT_SYMBOL(udp_memory_allocated);
-
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
@@ -1918,6 +1915,19 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(udp_poll);
 
+static atomic_long_t udp_memory_allocated;
+atomic_long_t *memory_allocated_udp(struct kmem_cgroup *sg)
+{
+	return &udp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_udp);
+
+long *udp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_udp_mem;
+}
+EXPORT_SYMBOL(udp_sysctl_mem);
+
 struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
@@ -1936,8 +1946,8 @@ struct proto udp_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = &memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp_sock),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d1fb63f..807797a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2012,7 +2012,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	percpu_counter_inc(&tcp_sockets_allocated);
+	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 
 	return 0;
@@ -2221,11 +2221,11 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
-	.memory_allocated	= &tcp_memory_allocated,
-	.memory_pressure	= &tcp_memory_pressure,
+	.sockets_allocated	= sockets_allocated_tcp,
+	.memory_allocated	= memory_allocated_tcp,
+	.memory_pressure	= memory_pressure_tcp,
 	.orphan_count		= &tcp_orphan_count,
-	.sysctl_mem		= sysctl_tcp_mem,
+	.prot_mem		= tcp_sysctl_mem,
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 29213b5..ef4b5b3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1465,8 +1465,8 @@ struct proto udpv6_prot = {
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
+	.memory_allocated  = memory_allocated_udp,
+	.prot_mem	   = udp_sysctl_mem,
 	.sysctl_wmem	   = &sysctl_udp_wmem_min,
 	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp6_sock),
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 836aa63..1b0300d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -119,11 +119,30 @@ static int sctp_memory_pressure;
 static atomic_long_t sctp_memory_allocated;
 struct percpu_counter sctp_sockets_allocated;
 
+static long *sctp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sysctl_sctp_mem;
+}
+
 static void sctp_enter_memory_pressure(struct sock *sk)
 {
 	sctp_memory_pressure = 1;
 }
 
+static int *memory_pressure_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_memory_pressure;
+}
+
+static atomic_long_t *memory_allocated_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_memory_allocated;
+}
+
+static struct percpu_counter *sockets_allocated_sctp(struct kmem_cgroup *sg)
+{
+	return &sctp_sockets_allocated;
+}
 
 /* Get the sndbuf space available at the time on the association.  */
 static inline int sctp_wspace(struct sctp_association *asoc)
@@ -6831,13 +6850,13 @@ struct proto sctp_prot = {
 	.unhash      =	sctp_unhash,
 	.get_port    =	sctp_get_port,
 	.obj_size    =  sizeof(struct sctp_sock),
-	.sysctl_mem  =  sysctl_sctp_mem,
+	.prot_mem    =  sctp_sysctl_mem,
 	.sysctl_rmem =  sysctl_sctp_rmem,
 	.sysctl_wmem =  sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -6863,12 +6882,12 @@ struct proto sctpv6_prot = {
 	.unhash		= sctp_unhash,
 	.get_port	= sctp_get_port,
 	.obj_size	= sizeof(struct sctp6_sock),
-	.sysctl_mem	= sysctl_sctp_mem,
+	.prot_mem	= sctp_sysctl_mem,
 	.sysctl_rmem	= sysctl_sctp_rmem,
 	.sysctl_wmem	= sysctl_sctp_wmem,
-	.memory_pressure = &sctp_memory_pressure,
+	.memory_pressure = memory_pressure_sctp,
 	.enter_memory_pressure = sctp_enter_memory_pressure,
-	.memory_allocated = &sctp_memory_allocated,
-	.sockets_allocated = &sctp_sockets_allocated,
+	.memory_allocated = memory_allocated_sctp,
+	.sockets_allocated = sockets_allocated_sctp,
 };
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 6/9] per-cgroup tcp buffers control
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

With all the infrastructure in place, this patch implements
per-cgroup control for tcp memory pressure handling.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |    7 ++++
 include/net/sock.h          |   10 ++++++-
 mm/kmem_cgroup.c            |   10 ++++++-
 net/core/sock.c             |   18 +++++++++++
 net/ipv4/tcp.c              |   67 +++++++++++++++++++++++++++++++++++++-----
 5 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index d983ba8..89ad0a1 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -23,6 +23,13 @@
 struct kmem_cgroup {
 	struct cgroup_subsys_state css;
 	struct kmem_cgroup *parent;
+
+#ifdef CONFIG_INET
+	int tcp_memory_pressure;
+	atomic_long_t tcp_memory_allocated;
+	struct percpu_counter tcp_sockets_allocated;
+	long tcp_prot_mem[3];
+#endif
 };
 
 
diff --git a/include/net/sock.h b/include/net/sock.h
index ab65640..91424e3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -64,6 +64,7 @@
 #include <net/dst.h>
 #include <net/checksum.h>
 
+int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -814,7 +815,14 @@ struct proto {
 	int			*(*memory_pressure)(struct kmem_cgroup *sg);
 	/* Pointer to the per-cgroup version of the the sysctl_mem field */
 	long			*(*prot_mem)(struct kmem_cgroup *sg);
-
+	/*
+	 * cgroup specific initialization function. Called once for all
+	 * protocols that implement it, from cgroups populate function.
+	 * This function has to setup any files the protocol want to
+	 * appear in the kmem cgroup filesystem.
+	 */
+	int			(*init_cgroup)(struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/mm/kmem_cgroup.c b/mm/kmem_cgroup.c
index 7950e69..5e53d66 100644
--- a/mm/kmem_cgroup.c
+++ b/mm/kmem_cgroup.c
@@ -17,16 +17,24 @@
 #include <linux/cgroup.h>
 #include <linux/slab.h>
 #include <linux/kmem_cgroup.h>
+#include <net/sock.h>
 
 static int kmem_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-	return 0;
+	int ret = 0;
+#ifdef CONFIG_NET
+	ret = sockets_populate(ss, cgrp);
+#endif
+	return ret;
 }
 
 static void
 kmem_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct kmem_cgroup *cg = kcg_from_cgroup(cgrp);
+#ifdef CONFIG_INET
+	percpu_counter_destroy(&cg->tcp_sockets_allocated);
+#endif
 	kfree(cg);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index ead9c02..9d833cf 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -134,6 +134,24 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct proto *proto;
+	int ret = 0;
+
+	read_lock(&proto_list_lock);
+	list_for_each_entry(proto, &proto_list, node) {
+		if (proto->init_cgroup)
+			ret |= proto->init_cgroup(cgrp, ss);
+	}
+	read_unlock(&proto_list_lock);
+
+	return ret;
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 76f03ed..0725dc4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -289,13 +289,6 @@ int sysctl_tcp_rmem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-
-/*
- * Current number of TCP sockets.
- */
-struct percpu_counter tcp_sockets_allocated;
-
 /*
  * TCP splice context
  */
@@ -305,13 +298,68 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
+#ifdef CONFIG_CGROUP_KMEM
 /*
  * Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the __sk_mem_schedule() is of this nature: accounting
  * is strict, actions are advisory and have some latency.
  */
-int tcp_memory_pressure __read_mostly;
+void tcp_enter_memory_pressure(struct sock *sk)
+{
+	struct kmem_cgroup *sg = sk->sk_cgrp;
+	if (!sg->tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		sg->tcp_memory_pressure = 1;
+	}
+}
+
+long *tcp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sg->tcp_prot_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &(sg->tcp_memory_allocated);
+}
+
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	unsigned long limit;
+	struct net *net = current->nsproxy->net_ns;
+
+	sg->tcp_memory_pressure = 0;
+	atomic_long_set(&sg->tcp_memory_allocated, 0);
+	percpu_counter_init(&sg->tcp_sockets_allocated, 0);
+
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+
+	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+int *memory_pressure_tcp(struct kmem_cgroup *sg)
+{
+	return &sg->tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &sg->tcp_sockets_allocated;
+}
+#else
+
+/* Current number of TCP sockets. */
+struct percpu_counter tcp_sockets_allocated;
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
+int tcp_memory_pressure;
 
 int *memory_pressure_tcp(struct kmem_cgroup *sg)
 {
@@ -340,6 +388,7 @@ atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
 {
 	return &tcp_memory_allocated;
 }
+#endif /* CONFIG_CGROUP_KMEM */
 
 EXPORT_SYMBOL(memory_pressure_tcp);
 EXPORT_SYMBOL(sockets_allocated_tcp);
@@ -3247,7 +3296,9 @@ void __init tcp_init(void)
 
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
 
+#ifndef CONFIG_CGROUP_KMEM
 	percpu_counter_init(&tcp_sockets_allocated, 0);
+#endif
 	percpu_counter_init(&tcp_orphan_count, 0);
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

With all the infrastructure in place, this patch implements
per-cgroup control for tcp memory pressure handling.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |    7 ++++
 include/net/sock.h          |   10 ++++++-
 mm/kmem_cgroup.c            |   10 ++++++-
 net/core/sock.c             |   18 +++++++++++
 net/ipv4/tcp.c              |   67 +++++++++++++++++++++++++++++++++++++-----
 5 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index d983ba8..89ad0a1 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -23,6 +23,13 @@
 struct kmem_cgroup {
 	struct cgroup_subsys_state css;
 	struct kmem_cgroup *parent;
+
+#ifdef CONFIG_INET
+	int tcp_memory_pressure;
+	atomic_long_t tcp_memory_allocated;
+	struct percpu_counter tcp_sockets_allocated;
+	long tcp_prot_mem[3];
+#endif
 };
 
 
diff --git a/include/net/sock.h b/include/net/sock.h
index ab65640..91424e3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -64,6 +64,7 @@
 #include <net/dst.h>
 #include <net/checksum.h>
 
+int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -814,7 +815,14 @@ struct proto {
 	int			*(*memory_pressure)(struct kmem_cgroup *sg);
 	/* Pointer to the per-cgroup version of the the sysctl_mem field */
 	long			*(*prot_mem)(struct kmem_cgroup *sg);
-
+	/*
+	 * cgroup specific initialization function. Called once for all
+	 * protocols that implement it, from cgroups populate function.
+	 * This function has to setup any files the protocol want to
+	 * appear in the kmem cgroup filesystem.
+	 */
+	int			(*init_cgroup)(struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/mm/kmem_cgroup.c b/mm/kmem_cgroup.c
index 7950e69..5e53d66 100644
--- a/mm/kmem_cgroup.c
+++ b/mm/kmem_cgroup.c
@@ -17,16 +17,24 @@
 #include <linux/cgroup.h>
 #include <linux/slab.h>
 #include <linux/kmem_cgroup.h>
+#include <net/sock.h>
 
 static int kmem_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-	return 0;
+	int ret = 0;
+#ifdef CONFIG_NET
+	ret = sockets_populate(ss, cgrp);
+#endif
+	return ret;
 }
 
 static void
 kmem_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct kmem_cgroup *cg = kcg_from_cgroup(cgrp);
+#ifdef CONFIG_INET
+	percpu_counter_destroy(&cg->tcp_sockets_allocated);
+#endif
 	kfree(cg);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index ead9c02..9d833cf 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -134,6 +134,24 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct proto *proto;
+	int ret = 0;
+
+	read_lock(&proto_list_lock);
+	list_for_each_entry(proto, &proto_list, node) {
+		if (proto->init_cgroup)
+			ret |= proto->init_cgroup(cgrp, ss);
+	}
+	read_unlock(&proto_list_lock);
+
+	return ret;
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 76f03ed..0725dc4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -289,13 +289,6 @@ int sysctl_tcp_rmem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-
-/*
- * Current number of TCP sockets.
- */
-struct percpu_counter tcp_sockets_allocated;
-
 /*
  * TCP splice context
  */
@@ -305,13 +298,68 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
+#ifdef CONFIG_CGROUP_KMEM
 /*
  * Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the __sk_mem_schedule() is of this nature: accounting
  * is strict, actions are advisory and have some latency.
  */
-int tcp_memory_pressure __read_mostly;
+void tcp_enter_memory_pressure(struct sock *sk)
+{
+	struct kmem_cgroup *sg = sk->sk_cgrp;
+	if (!sg->tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		sg->tcp_memory_pressure = 1;
+	}
+}
+
+long *tcp_sysctl_mem(struct kmem_cgroup *sg)
+{
+	return sg->tcp_prot_mem;
+}
+
+atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &(sg->tcp_memory_allocated);
+}
+
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	unsigned long limit;
+	struct net *net = current->nsproxy->net_ns;
+
+	sg->tcp_memory_pressure = 0;
+	atomic_long_set(&sg->tcp_memory_allocated, 0);
+	percpu_counter_init(&sg->tcp_sockets_allocated, 0);
+
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+
+	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+int *memory_pressure_tcp(struct kmem_cgroup *sg)
+{
+	return &sg->tcp_memory_pressure;
+}
+
+struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
+{
+	return &sg->tcp_sockets_allocated;
+}
+#else
+
+/* Current number of TCP sockets. */
+struct percpu_counter tcp_sockets_allocated;
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
+int tcp_memory_pressure;
 
 int *memory_pressure_tcp(struct kmem_cgroup *sg)
 {
@@ -340,6 +388,7 @@ atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
 {
 	return &tcp_memory_allocated;
 }
+#endif /* CONFIG_CGROUP_KMEM */
 
 EXPORT_SYMBOL(memory_pressure_tcp);
 EXPORT_SYMBOL(sockets_allocated_tcp);
@@ -3247,7 +3296,9 @@ void __init tcp_init(void)
 
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
 
+#ifndef CONFIG_CGROUP_KMEM
 	percpu_counter_init(&tcp_sockets_allocated, 0);
+#endif
 	percpu_counter_init(&tcp_orphan_count, 0);
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 7/9] tcp buffer limitation: per-cgroup limit
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch uses the "tcp_max_mem" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

We have to make sure that none of the memory pressure thresholds
specified in the namespace are bigger than the current cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |    1 +
 net/ipv4/sysctl_net_ipv4.c  |    8 ++++++
 net/ipv4/tcp.c              |   56 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index 89ad0a1..57a432b 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -26,6 +26,7 @@ struct kmem_cgroup {
 
 #ifdef CONFIG_INET
 	int tcp_memory_pressure;
+	int tcp_max_memory;
 	atomic_long_t tcp_memory_allocated;
 	struct percpu_counter tcp_sockets_allocated;
 	long tcp_prot_mem[3];
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0d74b9d..5e89480 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/kmem_cgroup.h>
 #include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
@@ -181,6 +182,7 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 {
 	int ret;
 	unsigned long vec[3];
+	struct kmem_cgroup *kmem = kcg_from_task(current);
 	struct net *net = current->nsproxy->net_ns;
 	int i;
 
@@ -200,7 +202,13 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 		return ret;
 
 	for (i = 0; i < 3; i++)
+		if (vec[i] > kmem->tcp_max_memory)
+			return -EINVAL;
+
+	for (i = 0; i < 3; i++) {
 		net->ipv4.sysctl_tcp_mem[i] = vec[i];
+		kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
+	}
 
 	return 0;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0725dc4..e1918fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -324,6 +324,55 @@ atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
 	return &(sg->tcp_memory_allocated);
 }
 
+static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
+	/*
+	 * We can't allow more memory than our parents. Since this
+	 * will be tested for all calls, by induction, there is no need
+	 * to test any parent other than our own
+	 * */
+	if (sg->parent && (val > sg->parent->tcp_max_memory))
+		val = sg->parent->tcp_max_memory;
+
+	sg->tcp_max_memory = val;
+
+	for (i = 0; i < 3; i++)
+		sg->tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
+static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = sg->tcp_max_memory;
+
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype tcp_files[] = {
+	{
+		.name = "tcp_maxmem",
+		.write_u64 = tcp_write_maxmem,
+		.read_u64 = tcp_read_maxmem,
+	},
+};
+
 int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 {
 	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
@@ -337,11 +386,16 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 	limit = nr_free_buffer_pages() / 8;
 	limit = max(limit, 128UL);
 
+	if (sg->parent)
+		sg->tcp_max_memory = sg->parent->tcp_max_memory;
+	else
+		sg->tcp_max_memory = limit * 2;
+
 	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
 	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
-	return 0;
+	return cgroup_add_files(cgrp, ss, tcp_files, ARRAY_SIZE(tcp_files));
 }
 EXPORT_SYMBOL(tcp_init_cgroup);
 
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 7/9] tcp buffer limitation: per-cgroup limit
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch uses the "tcp_max_mem" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

We have to make sure that none of the memory pressure thresholds
specified in the namespace are bigger than the current cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/kmem_cgroup.h |    1 +
 net/ipv4/sysctl_net_ipv4.c  |    8 ++++++
 net/ipv4/tcp.c              |   56 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
index 89ad0a1..57a432b 100644
--- a/include/linux/kmem_cgroup.h
+++ b/include/linux/kmem_cgroup.h
@@ -26,6 +26,7 @@ struct kmem_cgroup {
 
 #ifdef CONFIG_INET
 	int tcp_memory_pressure;
+	int tcp_max_memory;
 	atomic_long_t tcp_memory_allocated;
 	struct percpu_counter tcp_sockets_allocated;
 	long tcp_prot_mem[3];
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0d74b9d..5e89480 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/kmem_cgroup.h>
 #include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
@@ -181,6 +182,7 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 {
 	int ret;
 	unsigned long vec[3];
+	struct kmem_cgroup *kmem = kcg_from_task(current);
 	struct net *net = current->nsproxy->net_ns;
 	int i;
 
@@ -200,7 +202,13 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 		return ret;
 
 	for (i = 0; i < 3; i++)
+		if (vec[i] > kmem->tcp_max_memory)
+			return -EINVAL;
+
+	for (i = 0; i < 3; i++) {
 		net->ipv4.sysctl_tcp_mem[i] = vec[i];
+		kmem->tcp_prot_mem[i] = net->ipv4.sysctl_tcp_mem[i];
+	}
 
 	return 0;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0725dc4..e1918fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -324,6 +324,55 @@ atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
 	return &(sg->tcp_memory_allocated);
 }
 
+static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
+	/*
+	 * We can't allow more memory than our parents. Since this
+	 * will be tested for all calls, by induction, there is no need
+	 * to test any parent other than our own
+	 * */
+	if (sg->parent && (val > sg->parent->tcp_max_memory))
+		val = sg->parent->tcp_max_memory;
+
+	sg->tcp_max_memory = val;
+
+	for (i = 0; i < 3; i++)
+		sg->tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
+static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = sg->tcp_max_memory;
+
+	cgroup_unlock();
+	return ret;
+}
+
+static struct cftype tcp_files[] = {
+	{
+		.name = "tcp_maxmem",
+		.write_u64 = tcp_write_maxmem,
+		.read_u64 = tcp_read_maxmem,
+	},
+};
+
 int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 {
 	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
@@ -337,11 +386,16 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
 	limit = nr_free_buffer_pages() / 8;
 	limit = max(limit, 128UL);
 
+	if (sg->parent)
+		sg->tcp_max_memory = sg->parent->tcp_max_memory;
+	else
+		sg->tcp_max_memory = limit * 2;
+
 	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
 	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
-	return 0;
+	return cgroup_add_files(cgrp, ss, tcp_files, ARRAY_SIZE(tcp_files));
 }
 EXPORT_SYMBOL(tcp_init_cgroup);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 8/9] Display current tcp memory allocation in kmem cgroup
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch introduces kmem.tcp_current_memory file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 net/ipv4/tcp.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e1918fa..ff5c0e0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -365,12 +365,29 @@ static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
 	return ret;
 }
 
+static u64 tcp_read_curmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = atomic_long_read(&sg->tcp_memory_allocated);
+
+	cgroup_unlock();
+	return ret;
+}
+
 static struct cftype tcp_files[] = {
 	{
 		.name = "tcp_maxmem",
 		.write_u64 = tcp_write_maxmem,
 		.read_u64 = tcp_read_maxmem,
 	},
+	{
+		.name = "tcp_current_memory",
+		.read_u64 = tcp_read_curmem,
+	},
 };
 
 int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 8/9] Display current tcp memory allocation in kmem cgroup
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

This patch introduces kmem.tcp_current_memory file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 net/ipv4/tcp.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e1918fa..ff5c0e0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -365,12 +365,29 @@ static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
 	return ret;
 }
 
+static u64 tcp_read_curmem(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	ret = atomic_long_read(&sg->tcp_memory_allocated);
+
+	cgroup_unlock();
+	return ret;
+}
+
 static struct cftype tcp_files[] = {
 	{
 		.name = "tcp_maxmem",
 		.write_u64 = tcp_write_maxmem,
 		.read_u64 = tcp_read_maxmem,
 	},
+	{
+		.name = "tcp_current_memory",
+		.read_u64 = tcp_read_curmem,
+	},
 };
 
 int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 9/9] Add documentation about kmem_cgroup
  2011-09-07  4:23 ` Glauber Costa
@ 2011-09-07  4:23   ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman,
	Randy Dunlap

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Randy Dunlap <rdunlap@xenotime.net>
---
 Documentation/cgroups/kmem_cgroups.txt |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/kmem_cgroups.txt

diff --git a/Documentation/cgroups/kmem_cgroups.txt b/Documentation/cgroups/kmem_cgroups.txt
new file mode 100644
index 0000000..930e069
--- /dev/null
+++ b/Documentation/cgroups/kmem_cgroups.txt
@@ -0,0 +1,27 @@
+Kernel Memory Cgroup
+====================
+
+This document briefly describes the kernel memory cgroup, or "kmem cgroup".
+Unlike user memory, kernel memory cannot be swapped. This effectively means
+that rogue processes can start operations that pin kernel objects permanently
+into memory, exhausting resources of all other processes in the system.
+
+kmem_cgroup main goal is to control the amount of memory a group of processes
+can pin at any given point in time. Other uses of this infrastructure are
+expected to come up with time. Right now, the only resource effectively limited
+are tcp send and receive buffers.
+
+TCP network buffers
+===================
+
+TCP network buffers, both on the send and receive sides, can be controlled
+by the kmem cgroup. Once a socket is created, it is attached to the cgroup of
+the controller process, where it stays until the end of its lifetime.
+
+Files
+=====
+	kmem.tcp_maxmem: control the maximum amount in bytes that can be used by
+	tcp sockets inside the cgroup. 
+
+	kmem.tcp_current_memory: current amount in bytes used by all sockets in
+	this cgroup
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 9/9] Add documentation about kmem_cgroup
@ 2011-09-07  4:23   ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, containers, netdev, xemul, Glauber Costa,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman,
	Randy Dunlap

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Randy Dunlap <rdunlap@xenotime.net>
---
 Documentation/cgroups/kmem_cgroups.txt |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/kmem_cgroups.txt

diff --git a/Documentation/cgroups/kmem_cgroups.txt b/Documentation/cgroups/kmem_cgroups.txt
new file mode 100644
index 0000000..930e069
--- /dev/null
+++ b/Documentation/cgroups/kmem_cgroups.txt
@@ -0,0 +1,27 @@
+Kernel Memory Cgroup
+====================
+
+This document briefly describes the kernel memory cgroup, or "kmem cgroup".
+Unlike user memory, kernel memory cannot be swapped. This effectively means
+that rogue processes can start operations that pin kernel objects permanently
+into memory, exhausting resources of all other processes in the system.
+
+kmem_cgroup main goal is to control the amount of memory a group of processes
+can pin at any given point in time. Other uses of this infrastructure are
+expected to come up with time. Right now, the only resource effectively limited
+are tcp send and receive buffers.
+
+TCP network buffers
+===================
+
+TCP network buffers, both on the send and receive sides, can be controlled
+by the kmem cgroup. Once a socket is created, it is attached to the cgroup of
+the controller process, where it stays until the end of its lifetime.
+
+Files
+=====
+	kmem.tcp_maxmem: control the maximum amount in bytes that can be used by
+	tcp sockets inside the cgroup. 
+
+	kmem.tcp_current_memory: current amount in bytes used by all sockets in
+	this cgroup
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/9] Kernel Memory cgroup
  2011-09-07  4:23   ` Glauber Costa
@ 2011-09-07  5:24     ` Paul Menage
  -1 siblings, 0 replies; 59+ messages in thread
From: Paul Menage @ 2011-09-07  5:24 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa <glommer@parallels.com> wrote:
> +
> +struct kmem_cgroup {
> +       struct cgroup_subsys_state css;
> +       struct kmem_cgroup *parent;
> +};

There's a parent pointer in css.cgroup, so you shouldn't need a
separate one here.

Most cgroup subsystems define this structure (and the below accessor
functions) in their .c file rather than exposing it to the world? Does
this subsystem particularly need it exposed?

> +
> +static struct cgroup_subsys_state *kmem_create(
> +       struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);

kcg or just cg would be a better name?

Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/9] Kernel Memory cgroup
@ 2011-09-07  5:24     ` Paul Menage
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Menage @ 2011-09-07  5:24 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa <glommer@parallels.com> wrote:
> +
> +struct kmem_cgroup {
> +       struct cgroup_subsys_state css;
> +       struct kmem_cgroup *parent;
> +};

There's a parent pointer in css.cgroup, so you shouldn't need a
separate one here.

Most cgroup subsystems define this structure (and the below accessor
functions) in their .c file rather than exposing it to the world? Does
this subsystem particularly need it exposed?

> +
> +static struct cgroup_subsys_state *kmem_create(
> +       struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);

kcg or just cg would be a better name?

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-07  4:23   ` Glauber Costa
@ 2011-09-07  5:26     ` Paul Menage
  -1 siblings, 0 replies; 59+ messages in thread
From: Paul Menage @ 2011-09-07  5:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa <glommer@parallels.com> wrote:
> We aim to control the amount of kernel memory pinned at any
> time by tcp sockets. To lay the foundations for this work,
> this patch adds a pointer to the kmem_cgroup to the socket
> structure.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>  include/net/sock.h          |    2 ++
>  net/core/sock.c             |    5 ++---
>  3 files changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> index 0e4a74b..77076d8 100644
> --- a/include/linux/kmem_cgroup.h
> +++ b/include/linux/kmem_cgroup.h
> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>        return NULL;
>  }
>  #endif /* CONFIG_CGROUP_KMEM */
> +
> +#ifdef CONFIG_INET
> +#include <net/sock.h>
> +static inline void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +       sk->sk_cgrp = kcg_from_task(current);

BUG_ON(sk->sk_cgrp) ? Or else release the old cgroup if necessary.

> @@ -339,6 +340,7 @@ struct sock {
>  #endif
>        __u32                   sk_mark;
>        u32                     sk_classid;
> +       struct kmem_cgroup      *sk_cgrp;

Should this be protected by a #ifdef?

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-07  5:26     ` Paul Menage
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Menage @ 2011-09-07  5:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa <glommer@parallels.com> wrote:
> We aim to control the amount of kernel memory pinned at any
> time by tcp sockets. To lay the foundations for this work,
> this patch adds a pointer to the kmem_cgroup to the socket
> structure.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>  include/net/sock.h          |    2 ++
>  net/core/sock.c             |    5 ++---
>  3 files changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> index 0e4a74b..77076d8 100644
> --- a/include/linux/kmem_cgroup.h
> +++ b/include/linux/kmem_cgroup.h
> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>        return NULL;
>  }
>  #endif /* CONFIG_CGROUP_KMEM */
> +
> +#ifdef CONFIG_INET
> +#include <net/sock.h>
> +static inline void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +       sk->sk_cgrp = kcg_from_task(current);

BUG_ON(sk->sk_cgrp) ? Or else release the old cgroup if necessary.

> @@ -339,6 +340,7 @@ struct sock {
>  #endif
>        __u32                   sk_mark;
>        u32                     sk_classid;
> +       struct kmem_cgroup      *sk_cgrp;

Should this be protected by a #ifdef?

Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/9] Kernel Memory cgroup
  2011-09-07  5:24     ` Paul Menage
  (?)
@ 2011-09-07  5:55       ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  5:55 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/07/2011 02:24 AM, Paul Menage wrote:
> On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> +
>> +struct kmem_cgroup {
>> +       struct cgroup_subsys_state css;
>> +       struct kmem_cgroup *parent;
>> +};
>
> There's a parent pointer in css.cgroup, so you shouldn't need a
> separate one here.

Ok, I missed that. Thanks

> Most cgroup subsystems define this structure (and the below accessor
> functions) in their .c file rather than exposing it to the world? Does
> this subsystem particularly need it exposed?

Originally I was using it in sock.c and friends. Now, from the last 
submission to this one, most of those uses were substituted. The 
acessors, however, are in kmem_cgroup.h. Reason being I want most of 
them to be inline.

>> +
>> +static struct cgroup_subsys_state *kmem_create(
>> +       struct cgroup_subsys *ss, struct cgroup *cgrp)
>> +{
>> +       struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);
>
> kcg or just cg would be a better name?

I'll go with kcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/9] Kernel Memory cgroup
@ 2011-09-07  5:55       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  5:55 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/07/2011 02:24 AM, Paul Menage wrote:
> On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> +
>> +struct kmem_cgroup {
>> +       struct cgroup_subsys_state css;
>> +       struct kmem_cgroup *parent;
>> +};
>
> There's a parent pointer in css.cgroup, so you shouldn't need a
> separate one here.

Ok, I missed that. Thanks

> Most cgroup subsystems define this structure (and the below accessor
> functions) in their .c file rather than exposing it to the world? Does
> this subsystem particularly need it exposed?

Originally I was using it in sock.c and friends. Now, from the last 
submission to this one, most of those uses were substituted. The 
acessors, however, are in kmem_cgroup.h. Reason being I want most of 
them to be inline.

>> +
>> +static struct cgroup_subsys_state *kmem_create(
>> +       struct cgroup_subsys *ss, struct cgroup *cgrp)
>> +{
>> +       struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);
>
> kcg or just cg would be a better name?

I'll go with kcg.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/9] Kernel Memory cgroup
@ 2011-09-07  5:55       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  5:55 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/07/2011 02:24 AM, Paul Menage wrote:
> On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> +
>> +struct kmem_cgroup {
>> +       struct cgroup_subsys_state css;
>> +       struct kmem_cgroup *parent;
>> +};
>
> There's a parent pointer in css.cgroup, so you shouldn't need a
> separate one here.

Ok, I missed that. Thanks

> Most cgroup subsystems define this structure (and the below accessor
> functions) in their .c file rather than exposing it to the world? Does
> this subsystem particularly need it exposed?

Originally I was using it in sock.c and friends. Now, from the last 
submission to this one, most of those uses were substituted. The 
acessors, however, are in kmem_cgroup.h. Reason being I want most of 
them to be inline.

>> +
>> +static struct cgroup_subsys_state *kmem_create(
>> +       struct cgroup_subsys *ss, struct cgroup *cgrp)
>> +{
>> +       struct kmem_cgroup *sk = kzalloc(sizeof(*sk), GFP_KERNEL);
>
> kcg or just cg would be a better name?

I'll go with kcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-07  5:26     ` Paul Menage
  (?)
@ 2011-09-07  5:59       ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  5:59 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/07/2011 02:26 AM, Paul Menage wrote:
> On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>         return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       sk->sk_cgrp = kcg_from_task(current);
>
> BUG_ON(sk->sk_cgrp) ? Or else release the old cgroup if necessary.

Since at least in this current incarnation, I am not doing migrations,
I definitely don't expect to have a pointer already present here.
BUG_ON() it is.

>> @@ -339,6 +340,7 @@ struct sock {
>>   #endif
>>         __u32                   sk_mark;
>>         u32                     sk_classid;
>> +       struct kmem_cgroup      *sk_cgrp;
>
> Should this be protected by a #ifdef?
I don't particularly like it. I think that ifdef'ing fields
in structures, while allowing for size optimization, takes away
size and alignment predictability. But... can do.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-07  5:59       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  5:59 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/07/2011 02:26 AM, Paul Menage wrote:
> On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>         return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       sk->sk_cgrp = kcg_from_task(current);
>
> BUG_ON(sk->sk_cgrp) ? Or else release the old cgroup if necessary.

Since at least in this current incarnation, I am not doing migrations,
I definitely don't expect to have a pointer already present here.
BUG_ON() it is.

>> @@ -339,6 +340,7 @@ struct sock {
>>   #endif
>>         __u32                   sk_mark;
>>         u32                     sk_classid;
>> +       struct kmem_cgroup      *sk_cgrp;
>
> Should this be protected by a #ifdef?
I don't particularly like it. I think that ifdef'ing fields
in structures, while allowing for size optimization, takes away
size and alignment predictability. But... can do.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-07  5:59       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07  5:59 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/07/2011 02:26 AM, Paul Menage wrote:
> On Tue, Sep 6, 2011 at 9:23 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>         return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +       sk->sk_cgrp = kcg_from_task(current);
>
> BUG_ON(sk->sk_cgrp) ? Or else release the old cgroup if necessary.

Since at least in this current incarnation, I am not doing migrations,
I definitely don't expect to have a pointer already present here.
BUG_ON() it is.

>> @@ -339,6 +340,7 @@ struct sock {
>>   #endif
>>         __u32                   sk_mark;
>>         u32                     sk_classid;
>> +       struct kmem_cgroup      *sk_cgrp;
>
> Should this be protected by a #ifdef?
I don't particularly like it. I think that ifdef'ing fields
in structures, while allowing for size optimization, takes away
size and alignment predictability. But... can do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
  2011-09-07  4:23   ` Glauber Costa
  (?)
@ 2011-09-07  7:32   ` Li Zefan
  2011-09-07 13:02       ` Glauber Costa
  -1 siblings, 1 reply; 59+ messages in thread
From: Li Zefan @ 2011-09-07  7:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

> +#ifdef CONFIG_INET
> +#include <net/sock.h>
> +static inline void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	sk->sk_cgrp = kcg_from_task(current);
> +
> +	/*
> +	 * We don't need to protect against anything task-related, because
> +	 * we are basically stuck with the sock pointer that won't change,
> +	 * even if the task that originated the socket changes cgroups.
> +	 *
> +	 * What we do have to guarantee, is that the chain leading us to
> +	 * the top level won't change under our noses. Incrementing the
> +	 * reference count via cgroup_exclude_rmdir guarantees that.
> +	 */
> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
> +#endif

must be protected by rcu_read_lock.

> +}
> +
> +static inline void sock_release_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}

Ugly. Just use the way you define kcg_from_task().

> +
> +#endif /* CONFIG_INET */
>  #endif /* _LINUX_KMEM_CGROUP_H */
>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 8e4062f..709382f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -228,6 +228,7 @@ struct sock_common {
>    *	@sk_security: used by security modules
>    *	@sk_mark: generic packet mark
>    *	@sk_classid: this socket's cgroup classid
> +  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup 
>    *	@sk_write_pending: a write to stream socket waits to start
>    *	@sk_state_change: callback to indicate change in the state of the sock
>    *	@sk_data_ready: callback to indicate there is data to be processed
> @@ -339,6 +340,7 @@ struct sock {
>  #endif
>  	__u32			sk_mark;
>  	u32			sk_classid;
> +	struct kmem_cgroup	*sk_cgrp;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk, int bytes);
>  	void			(*sk_write_space)(struct sock *sk);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 3449df8..7109864 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1139,6 +1139,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		atomic_set(&sk->sk_wmem_alloc, 1);
>  
>  		sock_update_classid(sk);
> +		sock_update_kmem_cgrp(sk);
>  	}
>  
>  	return sk;
> @@ -1170,6 +1171,7 @@ static void __sk_free(struct sock *sk)
>  		put_cred(sk->sk_peer_cred);
>  	put_pid(sk->sk_peer_pid);
>  	put_net(sock_net(sk));
> +	sock_release_kmem_cgrp(sk);
>  	sk_prot_free(sk->sk_prot_creator, sk);
>  }
>  
> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_common_release);
>  
> -static DEFINE_RWLOCK(proto_list_lock);
> -static LIST_HEAD(proto_list);
> -

compile error.

you should do compile test after each single patch.

>  #ifdef CONFIG_PROC_FS
>  #define PROTO_INUSE_NR	64	/* should be enough for the first time */
>  struct prot_inuse {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
  2011-09-07  7:32   ` Li Zefan
  2011-09-07 13:02       ` Glauber Costa
@ 2011-09-07 13:02       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07 13:02 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/07/2011 04:32 AM, Li Zefan wrote:
>> +#ifdef CONFIG_INET
>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>
> must be protected by rcu_read_lock.

Ok.

>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>
> Ugly. Just use the way you define kcg_from_task().
Disagree.
This releases the pointer from the socket, not the task.
Actually, one of the assumptions I am making here, is that the cgroup
of the socket won't change, even if the task do change cgroups. Getting
the pointer from the task, breaks this. Without this, the code would
be much more complicated, since we'd have to unbill the memory accounted
every time we migrate tasks, and bill again to the new cgroup.


>
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 8e4062f..709382f 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -228,6 +228,7 @@ struct sock_common {
>>     *	@sk_security: used by security modules
>>     *	@sk_mark: generic packet mark
>>     *	@sk_classid: this socket's cgroup classid
>> +  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup
>>     *	@sk_write_pending: a write to stream socket waits to start
>>     *	@sk_state_change: callback to indicate change in the state of the sock
>>     *	@sk_data_ready: callback to indicate there is data to be processed
>> @@ -339,6 +340,7 @@ struct sock {
>>   #endif
>>   	__u32			sk_mark;
>>   	u32			sk_classid;
>> +	struct kmem_cgroup	*sk_cgrp;
>>   	void			(*sk_state_change)(struct sock *sk);
>>   	void			(*sk_data_ready)(struct sock *sk, int bytes);
>>   	void			(*sk_write_space)(struct sock *sk);
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 3449df8..7109864 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1139,6 +1139,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>>   		atomic_set(&sk->sk_wmem_alloc, 1);
>>
>>   		sock_update_classid(sk);
>> +		sock_update_kmem_cgrp(sk);
>>   	}
>>
>>   	return sk;
>> @@ -1170,6 +1171,7 @@ static void __sk_free(struct sock *sk)
>>   		put_cred(sk->sk_peer_cred);
>>   	put_pid(sk->sk_peer_pid);
>>   	put_net(sock_net(sk));
>> +	sock_release_kmem_cgrp(sk);
>>   	sk_prot_free(sk->sk_prot_creator, sk);
>>   }
>>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> compile error.
>
> you should do compile test after each single patch.
Oops, thanks for spotting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-07 13:02       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07 13:02 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/07/2011 04:32 AM, Li Zefan wrote:
>> +#ifdef CONFIG_INET
>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>
> must be protected by rcu_read_lock.

Ok.

>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>
> Ugly. Just use the way you define kcg_from_task().
Disagree.
This releases the pointer from the socket, not the task.
Actually, one of the assumptions I am making here, is that the cgroup
of the socket won't change, even if the task do change cgroups. Getting
the pointer from the task, breaks this. Without this, the code would
be much more complicated, since we'd have to unbill the memory accounted
every time we migrate tasks, and bill again to the new cgroup.


>
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 8e4062f..709382f 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -228,6 +228,7 @@ struct sock_common {
>>     *	@sk_security: used by security modules
>>     *	@sk_mark: generic packet mark
>>     *	@sk_classid: this socket's cgroup classid
>> +  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup
>>     *	@sk_write_pending: a write to stream socket waits to start
>>     *	@sk_state_change: callback to indicate change in the state of the sock
>>     *	@sk_data_ready: callback to indicate there is data to be processed
>> @@ -339,6 +340,7 @@ struct sock {
>>   #endif
>>   	__u32			sk_mark;
>>   	u32			sk_classid;
>> +	struct kmem_cgroup	*sk_cgrp;
>>   	void			(*sk_state_change)(struct sock *sk);
>>   	void			(*sk_data_ready)(struct sock *sk, int bytes);
>>   	void			(*sk_write_space)(struct sock *sk);
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 3449df8..7109864 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1139,6 +1139,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>>   		atomic_set(&sk->sk_wmem_alloc, 1);
>>
>>   		sock_update_classid(sk);
>> +		sock_update_kmem_cgrp(sk);
>>   	}
>>
>>   	return sk;
>> @@ -1170,6 +1171,7 @@ static void __sk_free(struct sock *sk)
>>   		put_cred(sk->sk_peer_cred);
>>   	put_pid(sk->sk_peer_pid);
>>   	put_net(sock_net(sk));
>> +	sock_release_kmem_cgrp(sk);
>>   	sk_prot_free(sk->sk_prot_creator, sk);
>>   }
>>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> compile error.
>
> you should do compile test after each single patch.
Oops, thanks for spotting.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-07 13:02       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-07 13:02 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/07/2011 04:32 AM, Li Zefan wrote:
>> +#ifdef CONFIG_INET
>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>
> must be protected by rcu_read_lock.

Ok.

>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>
> Ugly. Just use the way you define kcg_from_task().
Disagree.
This releases the pointer from the socket, not the task.
Actually, one of the assumptions I am making here, is that the cgroup
of the socket won't change, even if the task do change cgroups. Getting
the pointer from the task, breaks this. Without this, the code would
be much more complicated, since we'd have to unbill the memory accounted
every time we migrate tasks, and bill again to the new cgroup.


>
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 8e4062f..709382f 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -228,6 +228,7 @@ struct sock_common {
>>     *	@sk_security: used by security modules
>>     *	@sk_mark: generic packet mark
>>     *	@sk_classid: this socket's cgroup classid
>> +  *	@sk_cgrp: this socket's kernel memory (kmem) cgroup
>>     *	@sk_write_pending: a write to stream socket waits to start
>>     *	@sk_state_change: callback to indicate change in the state of the sock
>>     *	@sk_data_ready: callback to indicate there is data to be processed
>> @@ -339,6 +340,7 @@ struct sock {
>>   #endif
>>   	__u32			sk_mark;
>>   	u32			sk_classid;
>> +	struct kmem_cgroup	*sk_cgrp;
>>   	void			(*sk_state_change)(struct sock *sk);
>>   	void			(*sk_data_ready)(struct sock *sk, int bytes);
>>   	void			(*sk_write_space)(struct sock *sk);
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 3449df8..7109864 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -1139,6 +1139,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>>   		atomic_set(&sk->sk_wmem_alloc, 1);
>>
>>   		sock_update_classid(sk);
>> +		sock_update_kmem_cgrp(sk);
>>   	}
>>
>>   	return sk;
>> @@ -1170,6 +1171,7 @@ static void __sk_free(struct sock *sk)
>>   		put_cred(sk->sk_peer_cred);
>>   	put_pid(sk->sk_peer_pid);
>>   	put_net(sock_net(sk));
>> +	sock_release_kmem_cgrp(sk);
>>   	sk_prot_free(sk->sk_prot_creator, sk);
>>   }
>>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> compile error.
>
> you should do compile test after each single patch.
Oops, thanks for spotting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-07  4:23   ` Glauber Costa
@ 2011-09-07 22:17     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 59+ messages in thread
From: Kirill A. Shutemov @ 2011-09-07 22:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
> We aim to control the amount of kernel memory pinned at any
> time by tcp sockets. To lay the foundations for this work,
> this patch adds a pointer to the kmem_cgroup to the socket
> structure.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>  include/net/sock.h          |    2 ++
>  net/core/sock.c             |    5 ++---
>  3 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> index 0e4a74b..77076d8 100644
> --- a/include/linux/kmem_cgroup.h
> +++ b/include/linux/kmem_cgroup.h
> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>  	return NULL;
>  }
>  #endif /* CONFIG_CGROUP_KMEM */
> +
> +#ifdef CONFIG_INET

Will it break something if you define the helpers even if CONFIG_INET
is not defined?
It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
case.

> +#include <net/sock.h>
> +static inline void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	sk->sk_cgrp = kcg_from_task(current);
> +
> +	/*
> +	 * We don't need to protect against anything task-related, because
> +	 * we are basically stuck with the sock pointer that won't change,
> +	 * even if the task that originated the socket changes cgroups.
> +	 *
> +	 * What we do have to guarantee, is that the chain leading us to
> +	 * the top level won't change under our noses. Incrementing the
> +	 * reference count via cgroup_exclude_rmdir guarantees that.
> +	 */
> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
> +static inline void sock_release_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
> +#endif /* CONFIG_INET */
>  #endif /* _LINUX_KMEM_CGROUP_H */

> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_common_release);
>  
> -static DEFINE_RWLOCK(proto_list_lock);
> -static LIST_HEAD(proto_list);
> -

Wrong patch?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-07 22:17     ` Kirill A. Shutemov
  0 siblings, 0 replies; 59+ messages in thread
From: Kirill A. Shutemov @ 2011-09-07 22:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
> We aim to control the amount of kernel memory pinned at any
> time by tcp sockets. To lay the foundations for this work,
> this patch adds a pointer to the kmem_cgroup to the socket
> structure.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>  include/net/sock.h          |    2 ++
>  net/core/sock.c             |    5 ++---
>  3 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> index 0e4a74b..77076d8 100644
> --- a/include/linux/kmem_cgroup.h
> +++ b/include/linux/kmem_cgroup.h
> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>  	return NULL;
>  }
>  #endif /* CONFIG_CGROUP_KMEM */
> +
> +#ifdef CONFIG_INET

Will it break something if you define the helpers even if CONFIG_INET
is not defined?
It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
case.

> +#include <net/sock.h>
> +static inline void sock_update_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	sk->sk_cgrp = kcg_from_task(current);
> +
> +	/*
> +	 * We don't need to protect against anything task-related, because
> +	 * we are basically stuck with the sock pointer that won't change,
> +	 * even if the task that originated the socket changes cgroups.
> +	 *
> +	 * What we do have to guarantee, is that the chain leading us to
> +	 * the top level won't change under our noses. Incrementing the
> +	 * reference count via cgroup_exclude_rmdir guarantees that.
> +	 */
> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
> +static inline void sock_release_kmem_cgrp(struct sock *sk)
> +{
> +#ifdef CONFIG_CGROUP_KMEM
> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
> +#endif
> +}
> +
> +#endif /* CONFIG_INET */
>  #endif /* _LINUX_KMEM_CGROUP_H */

> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_common_release);
>  
> -static DEFINE_RWLOCK(proto_list_lock);
> -static LIST_HEAD(proto_list);
> -

Wrong patch?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-07 22:17     ` Kirill A. Shutemov
  (?)
@ 2011-09-08  4:54       ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-08  4:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>   	return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>
> Will it break something if you define the helpers even if CONFIG_INET
> is not defined?
> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> case.

The helpers inside CONFIG_INET are needed for the network code, 
regardless of kmem cgroup is defined or not, not the other way around.

So I could remove CONFIG_INET, but I can't possibly move it inside
CONFIG_CGROUP_KMEM. So this buy us nothing.

>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> Wrong patch?
Yes, it is. Thanks for noticing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-08  4:54       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-08  4:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>   	return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>
> Will it break something if you define the helpers even if CONFIG_INET
> is not defined?
> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> case.

The helpers inside CONFIG_INET are needed for the network code, 
regardless of kmem cgroup is defined or not, not the other way around.

So I could remove CONFIG_INET, but I can't possibly move it inside
CONFIG_CGROUP_KMEM. So this buy us nothing.

>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> Wrong patch?
Yes, it is. Thanks for noticing.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-08  4:54       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-08  4:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>   	return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>
> Will it break something if you define the helpers even if CONFIG_INET
> is not defined?
> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> case.

The helpers inside CONFIG_INET are needed for the network code, 
regardless of kmem cgroup is defined or not, not the other way around.

So I could remove CONFIG_INET, but I can't possibly move it inside
CONFIG_CGROUP_KMEM. So this buy us nothing.

>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> Wrong patch?
Yes, it is. Thanks for noticing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-08  4:54       ` Glauber Costa
@ 2011-09-08  5:35         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 59+ messages in thread
From: Kirill A. Shutemov @ 2011-09-08  5:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On Thu, Sep 08, 2011 at 01:54:03AM -0300, Glauber Costa wrote:
> On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> > On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
> >> We aim to control the amount of kernel memory pinned at any
> >> time by tcp sockets. To lay the foundations for this work,
> >> this patch adds a pointer to the kmem_cgroup to the socket
> >> structure.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >> ---
> >>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
> >>   include/net/sock.h          |    2 ++
> >>   net/core/sock.c             |    5 ++---
> >>   3 files changed, 33 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> >> index 0e4a74b..77076d8 100644
> >> --- a/include/linux/kmem_cgroup.h
> >> +++ b/include/linux/kmem_cgroup.h
> >> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
> >>   	return NULL;
> >>   }
> >>   #endif /* CONFIG_CGROUP_KMEM */
> >> +
> >> +#ifdef CONFIG_INET
> >
> > Will it break something if you define the helpers even if CONFIG_INET
> > is not defined?
> > It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> > case.
> 
> The helpers inside CONFIG_INET are needed for the network code, 
> regardless of kmem cgroup is defined or not, not the other way around.
> 
> So I could remove CONFIG_INET, but I can't possibly move it inside
> CONFIG_CGROUP_KMEM. So this buy us nothing.

You can define empty under CONFIG_CGROUP_KMEM's #else, can't you?
Like with kcg_from_cgroup()/kcg_from_task().

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-08  5:35         ` Kirill A. Shutemov
  0 siblings, 0 replies; 59+ messages in thread
From: Kirill A. Shutemov @ 2011-09-08  5:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On Thu, Sep 08, 2011 at 01:54:03AM -0300, Glauber Costa wrote:
> On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> > On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
> >> We aim to control the amount of kernel memory pinned at any
> >> time by tcp sockets. To lay the foundations for this work,
> >> this patch adds a pointer to the kmem_cgroup to the socket
> >> structure.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >> ---
> >>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
> >>   include/net/sock.h          |    2 ++
> >>   net/core/sock.c             |    5 ++---
> >>   3 files changed, 33 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> >> index 0e4a74b..77076d8 100644
> >> --- a/include/linux/kmem_cgroup.h
> >> +++ b/include/linux/kmem_cgroup.h
> >> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
> >>   	return NULL;
> >>   }
> >>   #endif /* CONFIG_CGROUP_KMEM */
> >> +
> >> +#ifdef CONFIG_INET
> >
> > Will it break something if you define the helpers even if CONFIG_INET
> > is not defined?
> > It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> > case.
> 
> The helpers inside CONFIG_INET are needed for the network code, 
> regardless of kmem cgroup is defined or not, not the other way around.
> 
> So I could remove CONFIG_INET, but I can't possibly move it inside
> CONFIG_CGROUP_KMEM. So this buy us nothing.

You can define empty under CONFIG_CGROUP_KMEM's #else, can't you?
Like with kcg_from_cgroup()/kcg_from_task().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
  2011-09-08  5:35         ` Kirill A. Shutemov
  (?)
@ 2011-09-08 12:41           ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-08 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/08/2011 02:35 AM, Kirill A. Shutemov wrote:
> On Thu, Sep 08, 2011 at 01:54:03AM -0300, Glauber Costa wrote:
>> On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
>>> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>>>> We aim to control the amount of kernel memory pinned at any
>>>> time by tcp sockets. To lay the foundations for this work,
>>>> this patch adds a pointer to the kmem_cgroup to the socket
>>>> structure.
>>>>
>>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>>> CC: David S. Miller<davem@davemloft.net>
>>>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>>>> CC: Eric W. Biederman<ebiederm@xmission.com>
>>>> ---
>>>>    include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>>>    include/net/sock.h          |    2 ++
>>>>    net/core/sock.c             |    5 ++---
>>>>    3 files changed, 33 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>>>> index 0e4a74b..77076d8 100644
>>>> --- a/include/linux/kmem_cgroup.h
>>>> +++ b/include/linux/kmem_cgroup.h
>>>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>>>    	return NULL;
>>>>    }
>>>>    #endif /* CONFIG_CGROUP_KMEM */
>>>> +
>>>> +#ifdef CONFIG_INET
>>>
>>> Will it break something if you define the helpers even if CONFIG_INET
>>> is not defined?
>>> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
>>> case.
>>
>> The helpers inside CONFIG_INET are needed for the network code,
>> regardless of kmem cgroup is defined or not, not the other way around.
>>
>> So I could remove CONFIG_INET, but I can't possibly move it inside
>> CONFIG_CGROUP_KMEM. So this buy us nothing.
>
> You can define empty under CONFIG_CGROUP_KMEM's #else, can't you?
> Like with kcg_from_cgroup()/kcg_from_task().
>
Do you really think it is cleaner?

Why would I define empty something that is not empty at all?
Look again. Most of those helpers would be the exact same with or 
without CONFIG_CGROUP_KMEM . The others, very few differences. If 
CONFIG_INET bothers you, I can remove it altogether, making it 
unconditional. But moving it inside CONFIG_CGROUP_KMEM makes no sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-08 12:41           ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-08 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/08/2011 02:35 AM, Kirill A. Shutemov wrote:
> On Thu, Sep 08, 2011 at 01:54:03AM -0300, Glauber Costa wrote:
>> On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
>>> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>>>> We aim to control the amount of kernel memory pinned at any
>>>> time by tcp sockets. To lay the foundations for this work,
>>>> this patch adds a pointer to the kmem_cgroup to the socket
>>>> structure.
>>>>
>>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>>> CC: David S. Miller<davem@davemloft.net>
>>>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>>>> CC: Eric W. Biederman<ebiederm@xmission.com>
>>>> ---
>>>>    include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>>>    include/net/sock.h          |    2 ++
>>>>    net/core/sock.c             |    5 ++---
>>>>    3 files changed, 33 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>>>> index 0e4a74b..77076d8 100644
>>>> --- a/include/linux/kmem_cgroup.h
>>>> +++ b/include/linux/kmem_cgroup.h
>>>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>>>    	return NULL;
>>>>    }
>>>>    #endif /* CONFIG_CGROUP_KMEM */
>>>> +
>>>> +#ifdef CONFIG_INET
>>>
>>> Will it break something if you define the helpers even if CONFIG_INET
>>> is not defined?
>>> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
>>> case.
>>
>> The helpers inside CONFIG_INET are needed for the network code,
>> regardless of kmem cgroup is defined or not, not the other way around.
>>
>> So I could remove CONFIG_INET, but I can't possibly move it inside
>> CONFIG_CGROUP_KMEM. So this buy us nothing.
>
> You can define empty under CONFIG_CGROUP_KMEM's #else, can't you?
> Like with kcg_from_cgroup()/kcg_from_task().
>
Do you really think it is cleaner?

Why would I define empty something that is not empty at all?
Look again. Most of those helpers would be the exact same with or 
without CONFIG_CGROUP_KMEM . The others, very few differences. If 
CONFIG_INET bothers you, I can remove it altogether, making it 
unconditional. But moving it inside CONFIG_CGROUP_KMEM makes no sense.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/9] socket: initial cgroup code.
@ 2011-09-08 12:41           ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-08 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller

On 09/08/2011 02:35 AM, Kirill A. Shutemov wrote:
> On Thu, Sep 08, 2011 at 01:54:03AM -0300, Glauber Costa wrote:
>> On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
>>> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>>>> We aim to control the amount of kernel memory pinned at any
>>>> time by tcp sockets. To lay the foundations for this work,
>>>> this patch adds a pointer to the kmem_cgroup to the socket
>>>> structure.
>>>>
>>>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>>>> CC: David S. Miller<davem@davemloft.net>
>>>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>>>> CC: Eric W. Biederman<ebiederm@xmission.com>
>>>> ---
>>>>    include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>>>    include/net/sock.h          |    2 ++
>>>>    net/core/sock.c             |    5 ++---
>>>>    3 files changed, 33 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>>>> index 0e4a74b..77076d8 100644
>>>> --- a/include/linux/kmem_cgroup.h
>>>> +++ b/include/linux/kmem_cgroup.h
>>>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>>>    	return NULL;
>>>>    }
>>>>    #endif /* CONFIG_CGROUP_KMEM */
>>>> +
>>>> +#ifdef CONFIG_INET
>>>
>>> Will it break something if you define the helpers even if CONFIG_INET
>>> is not defined?
>>> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
>>> case.
>>
>> The helpers inside CONFIG_INET are needed for the network code,
>> regardless of kmem cgroup is defined or not, not the other way around.
>>
>> So I could remove CONFIG_INET, but I can't possibly move it inside
>> CONFIG_CGROUP_KMEM. So this buy us nothing.
>
> You can define empty under CONFIG_CGROUP_KMEM's #else, can't you?
> Like with kcg_from_cgroup()/kcg_from_task().
>
Do you really think it is cleaner?

Why would I define empty something that is not empty at all?
Look again. Most of those helpers would be the exact same with or 
without CONFIG_CGROUP_KMEM . The others, very few differences. If 
CONFIG_INET bothers you, I can remove it altogether, making it 
unconditional. But moving it inside CONFIG_CGROUP_KMEM makes no sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 9/9] Add documentation about kmem_cgroup
  2011-09-07  4:23   ` Glauber Costa
@ 2011-09-08 17:46     ` Randy Dunlap
  -1 siblings, 0 replies; 59+ messages in thread
From: Randy Dunlap @ 2011-09-08 17:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/06/11 21:23, Glauber Costa wrote:
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: Randy Dunlap <rdunlap@xenotime.net>
> ---
>  Documentation/cgroups/kmem_cgroups.txt |   27 +++++++++++++++++++++++++++
>  1 files changed, 27 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/cgroups/kmem_cgroups.txt
> 
> diff --git a/Documentation/cgroups/kmem_cgroups.txt b/Documentation/cgroups/kmem_cgroups.txt
> new file mode 100644
> index 0000000..930e069
> --- /dev/null
> +++ b/Documentation/cgroups/kmem_cgroups.txt
> @@ -0,0 +1,27 @@
> +Kernel Memory Cgroup
> +====================
> +
> +This document briefly describes the kernel memory cgroup, or "kmem cgroup".
> +Unlike user memory, kernel memory cannot be swapped. This effectively means
> +that rogue processes can start operations that pin kernel objects permanently
> +into memory, exhausting resources of all other processes in the system.
> +
> +kmem_cgroup main goal is to control the amount of memory a group of processes

   kmem_cgroup's main goal

> +can pin at any given point in time. Other uses of this infrastructure are
> +expected to come up with time. Right now, the only resource effectively limited

                                                      resources

> +are tcp send and receive buffers.

or:
                                             the only resource effectively limited
  is TCP network buffers.

> +
> +TCP network buffers
> +===================
> +
> +TCP network buffers, both on the send and receive sides, can be controlled
> +by the kmem cgroup. Once a socket is created, it is attached to the cgroup of
> +the controller process, where it stays until the end of its lifetime.
> +
> +Files
> +=====
> +	kmem.tcp_maxmem: control the maximum amount in bytes that can be used by

	                 controls the maximum amount of memory in bytes ...


> +	tcp sockets inside the cgroup. 
> +
> +	kmem.tcp_current_memory: current amount in bytes used by all sockets in

	                         current amount of memory in bytes ...

> +	this cgroup


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 9/9] Add documentation about kmem_cgroup
@ 2011-09-08 17:46     ` Randy Dunlap
  0 siblings, 0 replies; 59+ messages in thread
From: Randy Dunlap @ 2011-09-08 17:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman

On 09/06/11 21:23, Glauber Costa wrote:
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: Randy Dunlap <rdunlap@xenotime.net>
> ---
>  Documentation/cgroups/kmem_cgroups.txt |   27 +++++++++++++++++++++++++++
>  1 files changed, 27 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/cgroups/kmem_cgroups.txt
> 
> diff --git a/Documentation/cgroups/kmem_cgroups.txt b/Documentation/cgroups/kmem_cgroups.txt
> new file mode 100644
> index 0000000..930e069
> --- /dev/null
> +++ b/Documentation/cgroups/kmem_cgroups.txt
> @@ -0,0 +1,27 @@
> +Kernel Memory Cgroup
> +====================
> +
> +This document briefly describes the kernel memory cgroup, or "kmem cgroup".
> +Unlike user memory, kernel memory cannot be swapped. This effectively means
> +that rogue processes can start operations that pin kernel objects permanently
> +into memory, exhausting resources of all other processes in the system.
> +
> +kmem_cgroup main goal is to control the amount of memory a group of processes

   kmem_cgroup's main goal

> +can pin at any given point in time. Other uses of this infrastructure are
> +expected to come up with time. Right now, the only resource effectively limited

                                                      resources

> +are tcp send and receive buffers.

or:
                                             the only resource effectively limited
  is TCP network buffers.

> +
> +TCP network buffers
> +===================
> +
> +TCP network buffers, both on the send and receive sides, can be controlled
> +by the kmem cgroup. Once a socket is created, it is attached to the cgroup of
> +the controller process, where it stays until the end of its lifetime.
> +
> +Files
> +=====
> +	kmem.tcp_maxmem: control the maximum amount in bytes that can be used by

	                 controls the maximum amount of memory in bytes ...


> +	tcp sockets inside the cgroup. 
> +
> +	kmem.tcp_current_memory: current amount in bytes used by all sockets in

	                         current amount of memory in bytes ...

> +	this cgroup


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
  2011-09-07  4:23   ` Glauber Costa
@ 2011-09-09  2:47     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-09  2:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Wed,  7 Sep 2011 01:23:11 -0300
Glauber Costa <glommer@parallels.com> wrote:

> This patch allows each namespace to independently set up
> its levels for tcp memory pressure thresholds. This patch
> alone does not buy much: we need to make this values
> per group of process somehow. This is achieved in the
> patches that follows in this patchset.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>


Hmm, it may be better to post this patch as independent one.

I'm not familiar with this area...but try review ;)

> ---
>  include/net/netns/ipv4.h   |    1 +
>  include/net/tcp.h          |    1 -
>  net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
>  net/ipv4/tcp.c             |   13 +++-------
>  4 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index d786b4f..bbd023a 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>  	int current_rt_cache_rebuild_count;
>  
>  	unsigned int sysctl_ping_group_range[2];
> +	long sysctl_tcp_mem[3];
>  
>  	atomic_t rt_genid;
>  	atomic_t dev_addr_genid;

Hmm, in original placement, sysctl_tcp_mem[] was on __read_mostly
area. Doesn't this placement cause many cache invalidations ?



> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 149a415..6bfdd9b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>  extern int sysctl_tcp_reordering;
>  extern int sysctl_tcp_ecn;
>  extern int sysctl_tcp_dsack;
> -extern long sysctl_tcp_mem[3];
>  extern int sysctl_tcp_wmem[3];
>  extern int sysctl_tcp_rmem[3];
>  extern int sysctl_tcp_app_win;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 69fd720..0d74b9d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/nsproxy.h>
> +#include <linux/swap.h>
>  #include <net/snmp.h>
>  #include <net/icmp.h>
>  #include <net/ip.h>
> @@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>  	return ret;
>  }
>  
> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
> +			   void __user *buffer, size_t *lenp,
> +			   loff_t *ppos)
> +{
> +	int ret;
> +	unsigned long vec[3];
> +	struct net *net = current->nsproxy->net_ns;
> +	int i;
> +
> +	ctl_table tmp = {
> +		.data = &vec,
> +		.maxlen = sizeof(vec),
> +		.mode = ctl->mode,
> +	};
> +
> +	if (!write) {
> +		ctl->data = &net->ipv4.sysctl_tcp_mem;
> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
> +	}
> +
> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < 3; i++)
> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
> +
> +	return 0;
> +}
> +
>  static struct ctl_table ipv4_table[] = {
>  	{
>  		.procname	= "tcp_timestamps",
> @@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler	= proc_dointvec
>  	},
>  	{
> -		.procname	= "tcp_mem",
> -		.data		= &sysctl_tcp_mem,
> -		.maxlen		= sizeof(sysctl_tcp_mem),
> -		.mode		= 0644,
> -		.proc_handler	= proc_doulongvec_minmax
> -	},
> -	{
>  		.procname	= "tcp_wmem",
>  		.data		= &sysctl_tcp_wmem,
>  		.maxlen		= sizeof(sysctl_tcp_wmem),
> @@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= ipv4_ping_group_range,
>  	},
> +	{
> +		.procname	= "tcp_mem",
> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
> +		.mode		= 0644,
> +		.proc_handler	= ipv4_tcp_mem,
> +	},
>  	{ }
>  };
>  




> @@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
>  {
>  	struct ctl_table *table;
> +	unsigned long limit;
>  
>  	table = ipv4_net_table;
>  	if (!net_eq(net, &init_net)) {
> @@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>  
>  	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>  
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
> +	net->ipv4.sysctl_tcp_mem[1] = limit;
> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
> +
>  	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>  			net_ipv4_ctl_path, table);
>  	if (net->ipv4.ipv4_hdr == NULL)
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 46febca..f06df24 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -266,6 +266,7 @@
>  #include <linux/crypto.h>
>  #include <linux/time.h>
>  #include <linux/slab.h>
> +#include <linux/nsproxy.h>
>  
>  #include <net/icmp.h>
>  #include <net/tcp.h>
> @@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>  
> -long sysctl_tcp_mem[3] __read_mostly;
>  int sysctl_tcp_wmem[3] __read_mostly;
>  int sysctl_tcp_rmem[3] __read_mostly;
>  
> -EXPORT_SYMBOL(sysctl_tcp_mem);
>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>  
> @@ -3277,14 +3276,10 @@ void __init tcp_init(void)
>  	sysctl_tcp_max_orphans = cnt / 2;
>  	sysctl_max_syn_backlog = max(128, cnt / 256);
>  
> -	limit = nr_free_buffer_pages() / 8;
> -	limit = max(limit, 128UL);
> -	sysctl_tcp_mem[0] = limit / 4 * 3;
> -	sysctl_tcp_mem[1] = limit;
> -	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
> -
>  	/* Set per-socket limits to no more than 1/128 the pressure threshold */
> -	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
> +	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
> +	limit <<= (PAGE_SHIFT - 7);
> +

I'm not sure but...why defined as 'long'  ?



BTW, when I grep,

tcp_input.c:        atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0])
tcp_input.c:    if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])

Don't you need to change this ?



Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
@ 2011-09-09  2:47     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-09  2:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Wed,  7 Sep 2011 01:23:11 -0300
Glauber Costa <glommer@parallels.com> wrote:

> This patch allows each namespace to independently set up
> its levels for tcp memory pressure thresholds. This patch
> alone does not buy much: we need to make this values
> per group of process somehow. This is achieved in the
> patches that follows in this patchset.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>


Hmm, it may be better to post this patch as independent one.

I'm not familiar with this area...but try review ;)

> ---
>  include/net/netns/ipv4.h   |    1 +
>  include/net/tcp.h          |    1 -
>  net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
>  net/ipv4/tcp.c             |   13 +++-------
>  4 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index d786b4f..bbd023a 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>  	int current_rt_cache_rebuild_count;
>  
>  	unsigned int sysctl_ping_group_range[2];
> +	long sysctl_tcp_mem[3];
>  
>  	atomic_t rt_genid;
>  	atomic_t dev_addr_genid;

Hmm, in original placement, sysctl_tcp_mem[] was on __read_mostly
area. Doesn't this placement cause many cache invalidations ?



> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 149a415..6bfdd9b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>  extern int sysctl_tcp_reordering;
>  extern int sysctl_tcp_ecn;
>  extern int sysctl_tcp_dsack;
> -extern long sysctl_tcp_mem[3];
>  extern int sysctl_tcp_wmem[3];
>  extern int sysctl_tcp_rmem[3];
>  extern int sysctl_tcp_app_win;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 69fd720..0d74b9d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -14,6 +14,7 @@
>  #include <linux/init.h>
>  #include <linux/slab.h>
>  #include <linux/nsproxy.h>
> +#include <linux/swap.h>
>  #include <net/snmp.h>
>  #include <net/icmp.h>
>  #include <net/ip.h>
> @@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>  	return ret;
>  }
>  
> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
> +			   void __user *buffer, size_t *lenp,
> +			   loff_t *ppos)
> +{
> +	int ret;
> +	unsigned long vec[3];
> +	struct net *net = current->nsproxy->net_ns;
> +	int i;
> +
> +	ctl_table tmp = {
> +		.data = &vec,
> +		.maxlen = sizeof(vec),
> +		.mode = ctl->mode,
> +	};
> +
> +	if (!write) {
> +		ctl->data = &net->ipv4.sysctl_tcp_mem;
> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
> +	}
> +
> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < 3; i++)
> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
> +
> +	return 0;
> +}
> +
>  static struct ctl_table ipv4_table[] = {
>  	{
>  		.procname	= "tcp_timestamps",
> @@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler	= proc_dointvec
>  	},
>  	{
> -		.procname	= "tcp_mem",
> -		.data		= &sysctl_tcp_mem,
> -		.maxlen		= sizeof(sysctl_tcp_mem),
> -		.mode		= 0644,
> -		.proc_handler	= proc_doulongvec_minmax
> -	},
> -	{
>  		.procname	= "tcp_wmem",
>  		.data		= &sysctl_tcp_wmem,
>  		.maxlen		= sizeof(sysctl_tcp_wmem),
> @@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= ipv4_ping_group_range,
>  	},
> +	{
> +		.procname	= "tcp_mem",
> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
> +		.mode		= 0644,
> +		.proc_handler	= ipv4_tcp_mem,
> +	},
>  	{ }
>  };
>  




> @@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
>  {
>  	struct ctl_table *table;
> +	unsigned long limit;
>  
>  	table = ipv4_net_table;
>  	if (!net_eq(net, &init_net)) {
> @@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>  
>  	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>  
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
> +	net->ipv4.sysctl_tcp_mem[1] = limit;
> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
> +
>  	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>  			net_ipv4_ctl_path, table);
>  	if (net->ipv4.ipv4_hdr == NULL)
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 46febca..f06df24 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -266,6 +266,7 @@
>  #include <linux/crypto.h>
>  #include <linux/time.h>
>  #include <linux/slab.h>
> +#include <linux/nsproxy.h>
>  
>  #include <net/icmp.h>
>  #include <net/tcp.h>
> @@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>  
> -long sysctl_tcp_mem[3] __read_mostly;
>  int sysctl_tcp_wmem[3] __read_mostly;
>  int sysctl_tcp_rmem[3] __read_mostly;
>  
> -EXPORT_SYMBOL(sysctl_tcp_mem);
>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>  
> @@ -3277,14 +3276,10 @@ void __init tcp_init(void)
>  	sysctl_tcp_max_orphans = cnt / 2;
>  	sysctl_max_syn_backlog = max(128, cnt / 256);
>  
> -	limit = nr_free_buffer_pages() / 8;
> -	limit = max(limit, 128UL);
> -	sysctl_tcp_mem[0] = limit / 4 * 3;
> -	sysctl_tcp_mem[1] = limit;
> -	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
> -
>  	/* Set per-socket limits to no more than 1/128 the pressure threshold */
> -	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
> +	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
> +	limit <<= (PAGE_SHIFT - 7);
> +

I'm not sure but...why defined as 'long'  ?



BTW, when I grep,

tcp_input.c:        atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0])
tcp_input.c:    if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])

Don't you need to change this ?



Thanks,
-Kame








^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
  2011-09-07  4:23   ` Glauber Costa
@ 2011-09-09  3:12     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-09  3:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Wed,  7 Sep 2011 01:23:16 -0300
Glauber Costa <glommer@parallels.com> wrote:

> With all the infrastructure in place, this patch implements
> per-cgroup control for tcp memory pressure handling.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>

Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
by other components ?

> ---
>  include/linux/kmem_cgroup.h |    7 ++++
>  include/net/sock.h          |   10 ++++++-
>  mm/kmem_cgroup.c            |   10 ++++++-
>  net/core/sock.c             |   18 +++++++++++
>  net/ipv4/tcp.c              |   67 +++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 102 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> index d983ba8..89ad0a1 100644
> --- a/include/linux/kmem_cgroup.h
> +++ b/include/linux/kmem_cgroup.h
> @@ -23,6 +23,13 @@
>  struct kmem_cgroup {
>  	struct cgroup_subsys_state css;
>  	struct kmem_cgroup *parent;
> +
> +#ifdef CONFIG_INET
> +	int tcp_memory_pressure;
> +	atomic_long_t tcp_memory_allocated;
> +	struct percpu_counter tcp_sockets_allocated;
> +	long tcp_prot_mem[3];
> +#endif
>  };

I think you should place 'read-mostly' values carefully.


>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ab65640..91424e3 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -64,6 +64,7 @@
>  #include <net/dst.h>
>  #include <net/checksum.h>
>  
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  /*
>   * This structure really needs to be cleaned up.
>   * Most of it is for TCP, and not used by any of
> @@ -814,7 +815,14 @@ struct proto {
>  	int			*(*memory_pressure)(struct kmem_cgroup *sg);
>  	/* Pointer to the per-cgroup version of the the sysctl_mem field */
>  	long			*(*prot_mem)(struct kmem_cgroup *sg);
> -
> +	/*
> +	 * cgroup specific initialization function. Called once for all
> +	 * protocols that implement it, from cgroups populate function.
> +	 * This function has to setup any files the protocol want to
> +	 * appear in the kmem cgroup filesystem.
> +	 */
> +	int			(*init_cgroup)(struct cgroup *cgrp,
> +					       struct cgroup_subsys *ss);
>  	int			*sysctl_wmem;
>  	int			*sysctl_rmem;
>  	int			max_header;
> diff --git a/mm/kmem_cgroup.c b/mm/kmem_cgroup.c
> index 7950e69..5e53d66 100644
> --- a/mm/kmem_cgroup.c
> +++ b/mm/kmem_cgroup.c
> @@ -17,16 +17,24 @@
>  #include <linux/cgroup.h>
>  #include <linux/slab.h>
>  #include <linux/kmem_cgroup.h>
> +#include <net/sock.h>
>  
>  static int kmem_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
> -	return 0;
> +	int ret = 0;
> +#ifdef CONFIG_NET
> +	ret = sockets_populate(ss, cgrp);
> +#endif

CONFIG_INET ?

> +	return ret;
>  }
>  
>  static void
>  kmem_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
>  	struct kmem_cgroup *cg = kcg_from_cgroup(cgrp);
> +#ifdef CONFIG_INET
> +	percpu_counter_destroy(&cg->tcp_sockets_allocated);
> +#endif
>  	kfree(cg);
>  }
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ead9c02..9d833cf 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -134,6 +134,24 @@
>  #include <net/tcp.h>
>  #endif
>  
> +static DEFINE_RWLOCK(proto_list_lock);
> +static LIST_HEAD(proto_list);
> +
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +	struct proto *proto;
> +	int ret = 0;
> +
> +	read_lock(&proto_list_lock);
> +	list_for_each_entry(proto, &proto_list, node) {
> +		if (proto->init_cgroup)
> +			ret |= proto->init_cgroup(cgrp, ss);
> +	}
> +	read_unlock(&proto_list_lock);
> +
> +	return ret;
> +}

Hmm, I don't understand this part but...

	ret |= ...

and no 'undo'  ? If no 'undo', ->init_cgroup() should success
always and no return value is required. 





> +
>  /*
>   * Each address family might have different locking rules, so we have
>   * one slock key per address family:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 76f03ed..0725dc4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -289,13 +289,6 @@ int sysctl_tcp_rmem[3] __read_mostly;
>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>  
> -atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
> -
> -/*
> - * Current number of TCP sockets.
> - */
> -struct percpu_counter tcp_sockets_allocated;
> -
>  /*
>   * TCP splice context
>   */
> @@ -305,13 +298,68 @@ struct tcp_splice_state {
>  	unsigned int flags;
>  };
>  
> +#ifdef CONFIG_CGROUP_KMEM
>  /*
>   * Pressure flag: try to collapse.
>   * Technical note: it is used by multiple contexts non atomically.
>   * All the __sk_mem_schedule() is of this nature: accounting
>   * is strict, actions are advisory and have some latency.
>   */
> -int tcp_memory_pressure __read_mostly;
> +void tcp_enter_memory_pressure(struct sock *sk)
> +{
> +	struct kmem_cgroup *sg = sk->sk_cgrp;
> +	if (!sg->tcp_memory_pressure) {
> +		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
> +		sg->tcp_memory_pressure = 1;
> +	}
> +}
> +
> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +	return sg->tcp_prot_mem;
> +}
> +
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +	return &(sg->tcp_memory_allocated);
> +}
> +
> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +	unsigned long limit;
> +	struct net *net = current->nsproxy->net_ns;
> +
> +	sg->tcp_memory_pressure = 0;
> +	atomic_long_set(&sg->tcp_memory_allocated, 0);
> +	percpu_counter_init(&sg->tcp_sockets_allocated, 0);
> +
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +
> +	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
> +	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
> +	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(tcp_init_cgroup);
> +
> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
> +{
> +	return &sg->tcp_memory_pressure;
> +}
> +
> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +	return &sg->tcp_sockets_allocated;
> +}
> +#else
> +
> +/* Current number of TCP sockets. */
> +struct percpu_counter tcp_sockets_allocated;
> +atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
> +int tcp_memory_pressure;
>  

you dropped __read_mostly.

Thanks,
-kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-09  3:12     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-09  3:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Wed,  7 Sep 2011 01:23:16 -0300
Glauber Costa <glommer@parallels.com> wrote:

> With all the infrastructure in place, this patch implements
> per-cgroup control for tcp memory pressure handling.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>

Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
by other components ?

> ---
>  include/linux/kmem_cgroup.h |    7 ++++
>  include/net/sock.h          |   10 ++++++-
>  mm/kmem_cgroup.c            |   10 ++++++-
>  net/core/sock.c             |   18 +++++++++++
>  net/ipv4/tcp.c              |   67 +++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 102 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> index d983ba8..89ad0a1 100644
> --- a/include/linux/kmem_cgroup.h
> +++ b/include/linux/kmem_cgroup.h
> @@ -23,6 +23,13 @@
>  struct kmem_cgroup {
>  	struct cgroup_subsys_state css;
>  	struct kmem_cgroup *parent;
> +
> +#ifdef CONFIG_INET
> +	int tcp_memory_pressure;
> +	atomic_long_t tcp_memory_allocated;
> +	struct percpu_counter tcp_sockets_allocated;
> +	long tcp_prot_mem[3];
> +#endif
>  };

I think you should place 'read-mostly' values carefully.


>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ab65640..91424e3 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -64,6 +64,7 @@
>  #include <net/dst.h>
>  #include <net/checksum.h>
>  
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  /*
>   * This structure really needs to be cleaned up.
>   * Most of it is for TCP, and not used by any of
> @@ -814,7 +815,14 @@ struct proto {
>  	int			*(*memory_pressure)(struct kmem_cgroup *sg);
>  	/* Pointer to the per-cgroup version of the the sysctl_mem field */
>  	long			*(*prot_mem)(struct kmem_cgroup *sg);
> -
> +	/*
> +	 * cgroup specific initialization function. Called once for all
> +	 * protocols that implement it, from cgroups populate function.
> +	 * This function has to setup any files the protocol want to
> +	 * appear in the kmem cgroup filesystem.
> +	 */
> +	int			(*init_cgroup)(struct cgroup *cgrp,
> +					       struct cgroup_subsys *ss);
>  	int			*sysctl_wmem;
>  	int			*sysctl_rmem;
>  	int			max_header;
> diff --git a/mm/kmem_cgroup.c b/mm/kmem_cgroup.c
> index 7950e69..5e53d66 100644
> --- a/mm/kmem_cgroup.c
> +++ b/mm/kmem_cgroup.c
> @@ -17,16 +17,24 @@
>  #include <linux/cgroup.h>
>  #include <linux/slab.h>
>  #include <linux/kmem_cgroup.h>
> +#include <net/sock.h>
>  
>  static int kmem_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
> -	return 0;
> +	int ret = 0;
> +#ifdef CONFIG_NET
> +	ret = sockets_populate(ss, cgrp);
> +#endif

CONFIG_INET ?

> +	return ret;
>  }
>  
>  static void
>  kmem_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
>  	struct kmem_cgroup *cg = kcg_from_cgroup(cgrp);
> +#ifdef CONFIG_INET
> +	percpu_counter_destroy(&cg->tcp_sockets_allocated);
> +#endif
>  	kfree(cg);
>  }
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ead9c02..9d833cf 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -134,6 +134,24 @@
>  #include <net/tcp.h>
>  #endif
>  
> +static DEFINE_RWLOCK(proto_list_lock);
> +static LIST_HEAD(proto_list);
> +
> +int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +	struct proto *proto;
> +	int ret = 0;
> +
> +	read_lock(&proto_list_lock);
> +	list_for_each_entry(proto, &proto_list, node) {
> +		if (proto->init_cgroup)
> +			ret |= proto->init_cgroup(cgrp, ss);
> +	}
> +	read_unlock(&proto_list_lock);
> +
> +	return ret;
> +}

Hmm, I don't understand this part but...

	ret |= ...

and no 'undo'  ? If no 'undo', ->init_cgroup() should success
always and no return value is required. 





> +
>  /*
>   * Each address family might have different locking rules, so we have
>   * one slock key per address family:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 76f03ed..0725dc4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -289,13 +289,6 @@ int sysctl_tcp_rmem[3] __read_mostly;
>  EXPORT_SYMBOL(sysctl_tcp_rmem);
>  EXPORT_SYMBOL(sysctl_tcp_wmem);
>  
> -atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
> -
> -/*
> - * Current number of TCP sockets.
> - */
> -struct percpu_counter tcp_sockets_allocated;
> -
>  /*
>   * TCP splice context
>   */
> @@ -305,13 +298,68 @@ struct tcp_splice_state {
>  	unsigned int flags;
>  };
>  
> +#ifdef CONFIG_CGROUP_KMEM
>  /*
>   * Pressure flag: try to collapse.
>   * Technical note: it is used by multiple contexts non atomically.
>   * All the __sk_mem_schedule() is of this nature: accounting
>   * is strict, actions are advisory and have some latency.
>   */
> -int tcp_memory_pressure __read_mostly;
> +void tcp_enter_memory_pressure(struct sock *sk)
> +{
> +	struct kmem_cgroup *sg = sk->sk_cgrp;
> +	if (!sg->tcp_memory_pressure) {
> +		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
> +		sg->tcp_memory_pressure = 1;
> +	}
> +}
> +
> +long *tcp_sysctl_mem(struct kmem_cgroup *sg)
> +{
> +	return sg->tcp_prot_mem;
> +}
> +
> +atomic_long_t *memory_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +	return &(sg->tcp_memory_allocated);
> +}
> +
> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +	struct kmem_cgroup *sg = kcg_from_cgroup(cgrp);
> +	unsigned long limit;
> +	struct net *net = current->nsproxy->net_ns;
> +
> +	sg->tcp_memory_pressure = 0;
> +	atomic_long_set(&sg->tcp_memory_allocated, 0);
> +	percpu_counter_init(&sg->tcp_sockets_allocated, 0);
> +
> +	limit = nr_free_buffer_pages() / 8;
> +	limit = max(limit, 128UL);
> +
> +	sg->tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
> +	sg->tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
> +	sg->tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(tcp_init_cgroup);
> +
> +int *memory_pressure_tcp(struct kmem_cgroup *sg)
> +{
> +	return &sg->tcp_memory_pressure;
> +}
> +
> +struct percpu_counter *sockets_allocated_tcp(struct kmem_cgroup *sg)
> +{
> +	return &sg->tcp_sockets_allocated;
> +}
> +#else
> +
> +/* Current number of TCP sockets. */
> +struct percpu_counter tcp_sockets_allocated;
> +atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
> +int tcp_memory_pressure;
>  

you dropped __read_mostly.

Thanks,
-kame


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
  2011-09-09  2:47     ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-09  4:19       ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-09  4:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On 09/08/2011 11:47 PM, KAMEZAWA Hiroyuki wrote:
> On Wed,  7 Sep 2011 01:23:11 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch allows each namespace to independently set up
>> its levels for tcp memory pressure thresholds. This patch
>> alone does not buy much: we need to make this values
>> per group of process somehow. This is achieved in the
>> patches that follows in this patchset.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
>
> Hmm, it may be better to post this patch as independent one.


Maybe we can search acks from eric about this one
specifically prior to merging, but I'd still like it to be part of
the whole. It will put us in a weird state if this is merged, and
the rest is not.

> I'm not familiar with this area...but try review ;)
Thank you!

>
>> ---
>>   include/net/netns/ipv4.h   |    1 +
>>   include/net/tcp.h          |    1 -
>>   net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
>>   net/ipv4/tcp.c             |   13 +++-------
>>   4 files changed, 49 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index d786b4f..bbd023a 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>   	int current_rt_cache_rebuild_count;
>>
>>   	unsigned int sysctl_ping_group_range[2];
>> +	long sysctl_tcp_mem[3];
>>
>>   	atomic_t rt_genid;
>>   	atomic_t dev_addr_genid;
>
> Hmm, in original placement, sysctl_tcp_mem[] was on __read_mostly
> area. Doesn't this placement cause many cache invalidations ?
>
Yes, you are right. I will move back to the old way of doing it.

>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 149a415..6bfdd9b 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>   extern int sysctl_tcp_reordering;
>>   extern int sysctl_tcp_ecn;
>>   extern int sysctl_tcp_dsack;
>> -extern long sysctl_tcp_mem[3];
>>   extern int sysctl_tcp_wmem[3];
>>   extern int sysctl_tcp_rmem[3];
>>   extern int sysctl_tcp_app_win;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 69fd720..0d74b9d 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>>   #include<net/ip.h>
>> @@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>>   	return ret;
>>   }
>>
>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> +			   void __user *buffer, size_t *lenp,
>> +			   loff_t *ppos)
>> +{
>> +	int ret;
>> +	unsigned long vec[3];
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	ctl_table tmp = {
>> +		.data =&vec,
>> +		.maxlen = sizeof(vec),
>> +		.mode = ctl->mode,
>> +	};
>> +
>> +	if (!write) {
>> +		ctl->data =&net->ipv4.sysctl_tcp_mem;
>> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
>> +	}
>> +
>> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
>> +
>> +	return 0;
>> +}
>> +
>>   static struct ctl_table ipv4_table[] = {
>>   	{
>>   		.procname	= "tcp_timestamps",
>> @@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
>>   		.proc_handler	= proc_dointvec
>>   	},
>>   	{
>> -		.procname	= "tcp_mem",
>> -		.data		=&sysctl_tcp_mem,
>> -		.maxlen		= sizeof(sysctl_tcp_mem),
>> -		.mode		= 0644,
>> -		.proc_handler	= proc_doulongvec_minmax
>> -	},
>> -	{
>>   		.procname	= "tcp_wmem",
>>   		.data		=&sysctl_tcp_wmem,
>>   		.maxlen		= sizeof(sysctl_tcp_wmem),
>> @@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
>>   		.mode		= 0644,
>>   		.proc_handler	= ipv4_ping_group_range,
>>   	},
>> +	{
>> +		.procname	= "tcp_mem",
>> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
>> +		.mode		= 0644,
>> +		.proc_handler	= ipv4_tcp_mem,
>> +	},
>>   	{ }
>>   };
>>
>
>
>
>
>> @@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>   static __net_init int ipv4_sysctl_init_net(struct net *net)
>>   {
>>   	struct ctl_table *table;
>> +	unsigned long limit;
>>
>>   	table = ipv4_net_table;
>>   	if (!net_eq(net,&init_net)) {
>> @@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>>
>>   	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>> +	net->ipv4.sysctl_tcp_mem[1] = limit;
>> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>> +
>>   	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>   			net_ipv4_ctl_path, table);
>>   	if (net->ipv4.ipv4_hdr == NULL)
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 46febca..f06df24 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -266,6 +266,7 @@
>>   #include<linux/crypto.h>
>>   #include<linux/time.h>
>>   #include<linux/slab.h>
>> +#include<linux/nsproxy.h>
>>
>>   #include<net/icmp.h>
>>   #include<net/tcp.h>
>> @@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>>   struct percpu_counter tcp_orphan_count;
>>   EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>
>> -long sysctl_tcp_mem[3] __read_mostly;
>>   int sysctl_tcp_wmem[3] __read_mostly;
>>   int sysctl_tcp_rmem[3] __read_mostly;
>>
>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>   EXPORT_SYMBOL(sysctl_tcp_rmem);
>>   EXPORT_SYMBOL(sysctl_tcp_wmem);
>>
>> @@ -3277,14 +3276,10 @@ void __init tcp_init(void)
>>   	sysctl_tcp_max_orphans = cnt / 2;
>>   	sysctl_max_syn_backlog = max(128, cnt / 256);
>>
>> -	limit = nr_free_buffer_pages() / 8;
>> -	limit = max(limit, 128UL);
>> -	sysctl_tcp_mem[0] = limit / 4 * 3;
>> -	sysctl_tcp_mem[1] = limit;
>> -	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
>> -
>>   	/* Set per-socket limits to no more than 1/128 the pressure threshold */
>> -	limit = ((unsigned long)sysctl_tcp_mem[1])<<  (PAGE_SHIFT - 7);
>> +	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
>> +	limit<<= (PAGE_SHIFT - 7);
>> +
>
> I'm not sure but...why defined as 'long'  ?
>

It is part of the "it was there
before" bundle.

It is defined as long not only for tcp, but for all of the
equivalents sysctl as well. So no reason to touch it, at least not
in this series =)

>
> BTW, when I grep,
>
> tcp_input.c:        atomic_long_read(&tcp_memory_allocated)<  sysctl_tcp_mem[0])
> tcp_input.c:    if (atomic_long_read(&tcp_memory_allocated)>= sysctl_tcp_mem[0])
>
> Don't you need to change this ?

It ended up being changed in another patch, and I missed the right
split.

Thank you, I will reorder it so it gets changed correctly.
>
>
> Thanks,
> -Kame
>
>
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
@ 2011-09-09  4:19       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-09  4:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On 09/08/2011 11:47 PM, KAMEZAWA Hiroyuki wrote:
> On Wed,  7 Sep 2011 01:23:11 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch allows each namespace to independently set up
>> its levels for tcp memory pressure thresholds. This patch
>> alone does not buy much: we need to make this values
>> per group of process somehow. This is achieved in the
>> patches that follows in this patchset.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
>
> Hmm, it may be better to post this patch as independent one.


Maybe we can search acks from eric about this one
specifically prior to merging, but I'd still like it to be part of
the whole. It will put us in a weird state if this is merged, and
the rest is not.

> I'm not familiar with this area...but try review ;)
Thank you!

>
>> ---
>>   include/net/netns/ipv4.h   |    1 +
>>   include/net/tcp.h          |    1 -
>>   net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
>>   net/ipv4/tcp.c             |   13 +++-------
>>   4 files changed, 49 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index d786b4f..bbd023a 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>   	int current_rt_cache_rebuild_count;
>>
>>   	unsigned int sysctl_ping_group_range[2];
>> +	long sysctl_tcp_mem[3];
>>
>>   	atomic_t rt_genid;
>>   	atomic_t dev_addr_genid;
>
> Hmm, in original placement, sysctl_tcp_mem[] was on __read_mostly
> area. Doesn't this placement cause many cache invalidations ?
>
Yes, you are right. I will move back to the old way of doing it.

>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 149a415..6bfdd9b 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>   extern int sysctl_tcp_reordering;
>>   extern int sysctl_tcp_ecn;
>>   extern int sysctl_tcp_dsack;
>> -extern long sysctl_tcp_mem[3];
>>   extern int sysctl_tcp_wmem[3];
>>   extern int sysctl_tcp_rmem[3];
>>   extern int sysctl_tcp_app_win;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 69fd720..0d74b9d 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>>   #include<net/ip.h>
>> @@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>>   	return ret;
>>   }
>>
>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> +			   void __user *buffer, size_t *lenp,
>> +			   loff_t *ppos)
>> +{
>> +	int ret;
>> +	unsigned long vec[3];
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	ctl_table tmp = {
>> +		.data =&vec,
>> +		.maxlen = sizeof(vec),
>> +		.mode = ctl->mode,
>> +	};
>> +
>> +	if (!write) {
>> +		ctl->data =&net->ipv4.sysctl_tcp_mem;
>> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
>> +	}
>> +
>> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
>> +
>> +	return 0;
>> +}
>> +
>>   static struct ctl_table ipv4_table[] = {
>>   	{
>>   		.procname	= "tcp_timestamps",
>> @@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
>>   		.proc_handler	= proc_dointvec
>>   	},
>>   	{
>> -		.procname	= "tcp_mem",
>> -		.data		=&sysctl_tcp_mem,
>> -		.maxlen		= sizeof(sysctl_tcp_mem),
>> -		.mode		= 0644,
>> -		.proc_handler	= proc_doulongvec_minmax
>> -	},
>> -	{
>>   		.procname	= "tcp_wmem",
>>   		.data		=&sysctl_tcp_wmem,
>>   		.maxlen		= sizeof(sysctl_tcp_wmem),
>> @@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
>>   		.mode		= 0644,
>>   		.proc_handler	= ipv4_ping_group_range,
>>   	},
>> +	{
>> +		.procname	= "tcp_mem",
>> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
>> +		.mode		= 0644,
>> +		.proc_handler	= ipv4_tcp_mem,
>> +	},
>>   	{ }
>>   };
>>
>
>
>
>
>> @@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>   static __net_init int ipv4_sysctl_init_net(struct net *net)
>>   {
>>   	struct ctl_table *table;
>> +	unsigned long limit;
>>
>>   	table = ipv4_net_table;
>>   	if (!net_eq(net,&init_net)) {
>> @@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>>
>>   	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>> +	net->ipv4.sysctl_tcp_mem[1] = limit;
>> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>> +
>>   	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>   			net_ipv4_ctl_path, table);
>>   	if (net->ipv4.ipv4_hdr == NULL)
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 46febca..f06df24 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -266,6 +266,7 @@
>>   #include<linux/crypto.h>
>>   #include<linux/time.h>
>>   #include<linux/slab.h>
>> +#include<linux/nsproxy.h>
>>
>>   #include<net/icmp.h>
>>   #include<net/tcp.h>
>> @@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>>   struct percpu_counter tcp_orphan_count;
>>   EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>
>> -long sysctl_tcp_mem[3] __read_mostly;
>>   int sysctl_tcp_wmem[3] __read_mostly;
>>   int sysctl_tcp_rmem[3] __read_mostly;
>>
>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>   EXPORT_SYMBOL(sysctl_tcp_rmem);
>>   EXPORT_SYMBOL(sysctl_tcp_wmem);
>>
>> @@ -3277,14 +3276,10 @@ void __init tcp_init(void)
>>   	sysctl_tcp_max_orphans = cnt / 2;
>>   	sysctl_max_syn_backlog = max(128, cnt / 256);
>>
>> -	limit = nr_free_buffer_pages() / 8;
>> -	limit = max(limit, 128UL);
>> -	sysctl_tcp_mem[0] = limit / 4 * 3;
>> -	sysctl_tcp_mem[1] = limit;
>> -	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
>> -
>>   	/* Set per-socket limits to no more than 1/128 the pressure threshold */
>> -	limit = ((unsigned long)sysctl_tcp_mem[1])<<  (PAGE_SHIFT - 7);
>> +	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
>> +	limit<<= (PAGE_SHIFT - 7);
>> +
>
> I'm not sure but...why defined as 'long'  ?
>

It is part of the "it was there
before" bundle.

It is defined as long not only for tcp, but for all of the
equivalents sysctl as well. So no reason to touch it, at least not
in this series =)

>
> BTW, when I grep,
>
> tcp_input.c:        atomic_long_read(&tcp_memory_allocated)<  sysctl_tcp_mem[0])
> tcp_input.c:    if (atomic_long_read(&tcp_memory_allocated)>= sysctl_tcp_mem[0])
>
> Don't you need to change this ?

It ended up being changed in another patch, and I missed the right
split.

Thank you, I will reorder it so it gets changed correctly.
>
>
> Thanks,
> -Kame
>
>
>
>
>
>
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem
@ 2011-09-09  4:19       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-09  4:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On 09/08/2011 11:47 PM, KAMEZAWA Hiroyuki wrote:
> On Wed,  7 Sep 2011 01:23:11 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> This patch allows each namespace to independently set up
>> its levels for tcp memory pressure thresholds. This patch
>> alone does not buy much: we need to make this values
>> per group of process somehow. This is achieved in the
>> patches that follows in this patchset.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
>
> Hmm, it may be better to post this patch as independent one.


Maybe we can search acks from eric about this one
specifically prior to merging, but I'd still like it to be part of
the whole. It will put us in a weird state if this is merged, and
the rest is not.

> I'm not familiar with this area...but try review ;)
Thank you!

>
>> ---
>>   include/net/netns/ipv4.h   |    1 +
>>   include/net/tcp.h          |    1 -
>>   net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
>>   net/ipv4/tcp.c             |   13 +++-------
>>   4 files changed, 49 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index d786b4f..bbd023a 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -55,6 +55,7 @@ struct netns_ipv4 {
>>   	int current_rt_cache_rebuild_count;
>>
>>   	unsigned int sysctl_ping_group_range[2];
>> +	long sysctl_tcp_mem[3];
>>
>>   	atomic_t rt_genid;
>>   	atomic_t dev_addr_genid;
>
> Hmm, in original placement, sysctl_tcp_mem[] was on __read_mostly
> area. Doesn't this placement cause many cache invalidations ?
>
Yes, you are right. I will move back to the old way of doing it.

>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 149a415..6bfdd9b 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -230,7 +230,6 @@ extern int sysctl_tcp_fack;
>>   extern int sysctl_tcp_reordering;
>>   extern int sysctl_tcp_ecn;
>>   extern int sysctl_tcp_dsack;
>> -extern long sysctl_tcp_mem[3];
>>   extern int sysctl_tcp_wmem[3];
>>   extern int sysctl_tcp_rmem[3];
>>   extern int sysctl_tcp_app_win;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 69fd720..0d74b9d 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -14,6 +14,7 @@
>>   #include<linux/init.h>
>>   #include<linux/slab.h>
>>   #include<linux/nsproxy.h>
>> +#include<linux/swap.h>
>>   #include<net/snmp.h>
>>   #include<net/icmp.h>
>>   #include<net/ip.h>
>> @@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
>>   	return ret;
>>   }
>>
>> +static int ipv4_tcp_mem(ctl_table *ctl, int write,
>> +			   void __user *buffer, size_t *lenp,
>> +			   loff_t *ppos)
>> +{
>> +	int ret;
>> +	unsigned long vec[3];
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int i;
>> +
>> +	ctl_table tmp = {
>> +		.data =&vec,
>> +		.maxlen = sizeof(vec),
>> +		.mode = ctl->mode,
>> +	};
>> +
>> +	if (!write) {
>> +		ctl->data =&net->ipv4.sysctl_tcp_mem;
>> +		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
>> +	}
>> +
>> +	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (i = 0; i<  3; i++)
>> +		net->ipv4.sysctl_tcp_mem[i] = vec[i];
>> +
>> +	return 0;
>> +}
>> +
>>   static struct ctl_table ipv4_table[] = {
>>   	{
>>   		.procname	= "tcp_timestamps",
>> @@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
>>   		.proc_handler	= proc_dointvec
>>   	},
>>   	{
>> -		.procname	= "tcp_mem",
>> -		.data		=&sysctl_tcp_mem,
>> -		.maxlen		= sizeof(sysctl_tcp_mem),
>> -		.mode		= 0644,
>> -		.proc_handler	= proc_doulongvec_minmax
>> -	},
>> -	{
>>   		.procname	= "tcp_wmem",
>>   		.data		=&sysctl_tcp_wmem,
>>   		.maxlen		= sizeof(sysctl_tcp_wmem),
>> @@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
>>   		.mode		= 0644,
>>   		.proc_handler	= ipv4_ping_group_range,
>>   	},
>> +	{
>> +		.procname	= "tcp_mem",
>> +		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
>> +		.mode		= 0644,
>> +		.proc_handler	= ipv4_tcp_mem,
>> +	},
>>   	{ }
>>   };
>>
>
>
>
>
>> @@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>>   static __net_init int ipv4_sysctl_init_net(struct net *net)
>>   {
>>   	struct ctl_table *table;
>> +	unsigned long limit;
>>
>>   	table = ipv4_net_table;
>>   	if (!net_eq(net,&init_net)) {
>> @@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>>
>>   	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>>
>> +	limit = nr_free_buffer_pages() / 8;
>> +	limit = max(limit, 128UL);
>> +	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>> +	net->ipv4.sysctl_tcp_mem[1] = limit;
>> +	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>> +
>>   	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>>   			net_ipv4_ctl_path, table);
>>   	if (net->ipv4.ipv4_hdr == NULL)
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 46febca..f06df24 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -266,6 +266,7 @@
>>   #include<linux/crypto.h>
>>   #include<linux/time.h>
>>   #include<linux/slab.h>
>> +#include<linux/nsproxy.h>
>>
>>   #include<net/icmp.h>
>>   #include<net/tcp.h>
>> @@ -282,11 +283,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>>   struct percpu_counter tcp_orphan_count;
>>   EXPORT_SYMBOL_GPL(tcp_orphan_count);
>>
>> -long sysctl_tcp_mem[3] __read_mostly;
>>   int sysctl_tcp_wmem[3] __read_mostly;
>>   int sysctl_tcp_rmem[3] __read_mostly;
>>
>> -EXPORT_SYMBOL(sysctl_tcp_mem);
>>   EXPORT_SYMBOL(sysctl_tcp_rmem);
>>   EXPORT_SYMBOL(sysctl_tcp_wmem);
>>
>> @@ -3277,14 +3276,10 @@ void __init tcp_init(void)
>>   	sysctl_tcp_max_orphans = cnt / 2;
>>   	sysctl_max_syn_backlog = max(128, cnt / 256);
>>
>> -	limit = nr_free_buffer_pages() / 8;
>> -	limit = max(limit, 128UL);
>> -	sysctl_tcp_mem[0] = limit / 4 * 3;
>> -	sysctl_tcp_mem[1] = limit;
>> -	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
>> -
>>   	/* Set per-socket limits to no more than 1/128 the pressure threshold */
>> -	limit = ((unsigned long)sysctl_tcp_mem[1])<<  (PAGE_SHIFT - 7);
>> +	limit = (unsigned long)init_net.ipv4.sysctl_tcp_mem[1];
>> +	limit<<= (PAGE_SHIFT - 7);
>> +
>
> I'm not sure but...why defined as 'long'  ?
>

It is part of the "it was there
before" bundle.

It is defined as long not only for tcp, but for all of the
equivalents sysctl as well. So no reason to touch it, at least not
in this series =)

>
> BTW, when I grep,
>
> tcp_input.c:        atomic_long_read(&tcp_memory_allocated)<  sysctl_tcp_mem[0])
> tcp_input.c:    if (atomic_long_read(&tcp_memory_allocated)>= sysctl_tcp_mem[0])
>
> Don't you need to change this ?

It ended up being changed in another patch, and I missed the right
split.

Thank you, I will reorder it so it gets changed correctly.
>
>
> Thanks,
> -Kame
>
>
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
  2011-09-09  3:12     ` KAMEZAWA Hiroyuki
  (?)
@ 2011-09-09 12:01       ` Glauber Costa
  -1 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-09 12:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On 09/09/2011 12:12 AM, KAMEZAWA Hiroyuki wrote:
> On Wed,  7 Sep 2011 01:23:16 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
> by other components ?

Kame,

Refer to my discussion with Greg. How would you feel about it being 
accounted to a single "kernel memory" limit in memcg?

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-09 12:01       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-09 12:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On 09/09/2011 12:12 AM, KAMEZAWA Hiroyuki wrote:
> On Wed,  7 Sep 2011 01:23:16 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
> by other components ?

Kame,

Refer to my discussion with Greg. How would you feel about it being 
accounted to a single "kernel memory" limit in memcg?

Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-09 12:01       ` Glauber Costa
  0 siblings, 0 replies; 59+ messages in thread
From: Glauber Costa @ 2011-09-09 12:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On 09/09/2011 12:12 AM, KAMEZAWA Hiroyuki wrote:
> On Wed,  7 Sep 2011 01:23:16 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> With all the infrastructure in place, this patch implements
>> per-cgroup control for tcp memory pressure handling.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>
> Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
> by other components ?

Kame,

Refer to my discussion with Greg. How would you feel about it being 
accounted to a single "kernel memory" limit in memcg?

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
  2011-09-09 12:01       ` Glauber Costa
  (?)
@ 2011-09-12 10:31         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-12 10:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Fri, 9 Sep 2011 09:01:32 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/09/2011 12:12 AM, KAMEZAWA Hiroyuki wrote:
> > On Wed,  7 Sep 2011 01:23:16 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> With all the infrastructure in place, this patch implements
> >> per-cgroup control for tcp memory pressure handling.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >
> > Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
> > by other components ?
> 
> Kame,
> 
> Refer to my discussion with Greg. How would you feel about it being 
> accounted to a single "kernel memory" limit in memcg?
> 

Hmm, it's argued that 'cgroup is hard to use, it's difficult!!!'.

Then, if implementation is clean, I think it may be good to add
kmem limit to memcg.

Your and Greg's idea is to have

	memory.kmem_limit_in_bytes 
?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-12 10:31         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-12 10:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Fri, 9 Sep 2011 09:01:32 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/09/2011 12:12 AM, KAMEZAWA Hiroyuki wrote:
> > On Wed,  7 Sep 2011 01:23:16 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> With all the infrastructure in place, this patch implements
> >> per-cgroup control for tcp memory pressure handling.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >
> > Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
> > by other components ?
> 
> Kame,
> 
> Refer to my discussion with Greg. How would you feel about it being 
> accounted to a single "kernel memory" limit in memcg?
> 

Hmm, it's argued that 'cgroup is hard to use, it's difficult!!!'.

Then, if implementation is clean, I think it may be good to add
kmem limit to memcg.

Your and Greg's idea is to have

	memory.kmem_limit_in_bytes 
?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 6/9] per-cgroup tcp buffers control
@ 2011-09-12 10:31         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 59+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-12 10:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Eric W. Biederman

On Fri, 9 Sep 2011 09:01:32 -0300
Glauber Costa <glommer@parallels.com> wrote:

> On 09/09/2011 12:12 AM, KAMEZAWA Hiroyuki wrote:
> > On Wed,  7 Sep 2011 01:23:16 -0300
> > Glauber Costa<glommer@parallels.com>  wrote:
> >
> >> With all the infrastructure in place, this patch implements
> >> per-cgroup control for tcp memory pressure handling.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >
> > Hmm, then, kmem_cgroup.c is just a caller of plugins implemented
> > by other components ?
> 
> Kame,
> 
> Refer to my discussion with Greg. How would you feel about it being 
> accounted to a single "kernel memory" limit in memcg?
> 

Hmm, it's argued that 'cgroup is hard to use, it's difficult!!!'.

Then, if implementation is clean, I think it may be good to add
kmem limit to memcg.

Your and Greg's idea is to have

	memory.kmem_limit_in_bytes 
?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2011-09-12 10:32 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-07  4:23 [PATCH v2 0/9] per-cgroup tcp buffers limitation Glauber Costa
2011-09-07  4:23 ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 1/9] per-netns ipv4 sysctl_tcp_mem Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-09  2:47   ` KAMEZAWA Hiroyuki
2011-09-09  2:47     ` KAMEZAWA Hiroyuki
2011-09-09  4:19     ` Glauber Costa
2011-09-09  4:19       ` Glauber Costa
2011-09-09  4:19       ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 2/9] Kernel Memory cgroup Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  5:24   ` Paul Menage
2011-09-07  5:24     ` Paul Menage
2011-09-07  5:55     ` Glauber Costa
2011-09-07  5:55       ` Glauber Costa
2011-09-07  5:55       ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 3/9] socket: initial cgroup code Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  5:26   ` Paul Menage
2011-09-07  5:26     ` Paul Menage
2011-09-07  5:59     ` Glauber Costa
2011-09-07  5:59       ` Glauber Costa
2011-09-07  5:59       ` Glauber Costa
2011-09-07 22:17   ` Kirill A. Shutemov
2011-09-07 22:17     ` Kirill A. Shutemov
2011-09-08  4:54     ` Glauber Costa
2011-09-08  4:54       ` Glauber Costa
2011-09-08  4:54       ` Glauber Costa
2011-09-08  5:35       ` Kirill A. Shutemov
2011-09-08  5:35         ` Kirill A. Shutemov
2011-09-08 12:41         ` Glauber Costa
2011-09-08 12:41           ` Glauber Costa
2011-09-08 12:41           ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 4/9] function wrappers for upcoming socket Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 5/9] foundations of per-cgroup memory pressure controlling Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 6/9] per-cgroup tcp buffers control Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  7:32   ` Li Zefan
2011-09-07 13:02     ` Glauber Costa
2011-09-07 13:02       ` Glauber Costa
2011-09-07 13:02       ` Glauber Costa
2011-09-09  3:12   ` KAMEZAWA Hiroyuki
2011-09-09  3:12     ` KAMEZAWA Hiroyuki
2011-09-09 12:01     ` Glauber Costa
2011-09-09 12:01       ` Glauber Costa
2011-09-09 12:01       ` Glauber Costa
2011-09-12 10:31       ` KAMEZAWA Hiroyuki
2011-09-12 10:31         ` KAMEZAWA Hiroyuki
2011-09-12 10:31         ` KAMEZAWA Hiroyuki
2011-09-07  4:23 ` [PATCH v2 7/9] tcp buffer limitation: per-cgroup limit Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 8/9] Display current tcp memory allocation in kmem cgroup Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-07  4:23 ` [PATCH v2 9/9] Add documentation about kmem_cgroup Glauber Costa
2011-09-07  4:23   ` Glauber Costa
2011-09-08 17:46   ` Randy Dunlap
2011-09-08 17:46     ` Randy Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.