All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Networking cgroup controller
@ 2016-08-11  0:53 Anoop Naravaram
  2016-08-11  0:53 ` [PATCH 1/5] net: create the networking " Anoop Naravaram
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev
  Cc: edumazet, maheshb, weiwan, tom, Anoop Naravaram

This patchset introduces a cgroup controller for the networking subsystem as a
whole. As of now, this controller will be used for:

* Limiting the specific ports that a process in a cgroup is allowed to bind
  to or listen on. For example, you can say that all the processes in a
  cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
  guarantees that the remaining ports will be available for other processes.

* Restricting which DSCP values processes can use with their sockets. For
  example, you can say that all the processes in a cgroup can only send
  packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
  192 to 255).

* Limiting the total number of udp ports that can be used by a process in a
  cgroup. For example, you can say that all the processes in one cgroup are
  allowed to use a total of up to 100 udp ports. Since the total number of udp
  ports that can be used by all processes is limited, this is useful for
  rationing out the ports to different process groups.

In the future, more networking-related properties may be added to this
controller.

Anoop Naravaram (5):
  net: create the networking cgroup controller
  net: add bind/listen ranges to net cgroup
  net: add udp limit to net cgroup
  net: add dscp ranges to net cgroup
  net: add test for net cgroup

 Documentation/cgroup-v1/net.txt   |  95 +++++
 include/linux/cgroup_subsys.h     |   4 +
 include/net/net_cgroup.h          | 103 ++++++
 net/Kconfig                       |  10 +
 net/core/Makefile                 |   1 +
 net/core/net_cgroup.c             | 706 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c                |   8 +
 net/ipv4/inet_connection_sock.c   |   7 +
 net/ipv4/ip_sockglue.c            |  13 +
 net/ipv4/udp.c                    |   8 +
 net/ipv6/af_inet6.c               |   7 +
 net/ipv6/datagram.c               |   9 +
 net/ipv6/ipv6_sockglue.c          |   8 +
 scripts/cgroup/net_cgroup_test.py | 359 +++++++++++++++++++
 14 files changed, 1338 insertions(+)
 create mode 100644 Documentation/cgroup-v1/net.txt
 create mode 100644 include/net/net_cgroup.h
 create mode 100644 net/core/net_cgroup.c
 create mode 100755 scripts/cgroup/net_cgroup_test.py

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] net: create the networking cgroup controller
  2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
@ 2016-08-11  0:53 ` Anoop Naravaram
  2016-08-11  0:53   ` Anoop Naravaram
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev
  Cc: edumazet, maheshb, weiwan, tom, Anoop Naravaram

This is a skeleton implementation of a cgroup controller for networking
properties. It will be used for:
* limiting the specific ports that a process in a cgroup is allowed to bind
  to or listen on
* restricting which dscp values processes can use with their sockets
* limiting the total number of udp ports that can be used by a process

Also there is new documentation of this controller in
Documentation/cgroup-v1/net.txt.

Signed-off-by: Anoop Naravaram <anaravaram@google.com>
---
 Documentation/cgroup-v1/net.txt |  9 ++++++
 include/linux/cgroup_subsys.h   |  4 +++
 include/net/net_cgroup.h        | 27 ++++++++++++++++++
 net/Kconfig                     | 10 +++++++
 net/core/Makefile               |  1 +
 net/core/net_cgroup.c           | 62 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 113 insertions(+)
 create mode 100644 Documentation/cgroup-v1/net.txt
 create mode 100644 include/net/net_cgroup.h
 create mode 100644 net/core/net_cgroup.c

diff --git a/Documentation/cgroup-v1/net.txt b/Documentation/cgroup-v1/net.txt
new file mode 100644
index 0000000..580c214
--- /dev/null
+++ b/Documentation/cgroup-v1/net.txt
@@ -0,0 +1,9 @@
+Networking cgroup
+=================
+
+The net cgroup controller keeps track of the following networking related
+properties for each process group:
+* bind port ranges
+* listen port ranges
+* dscp ranges
+* udp port usage and limit
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..81ff75b 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -40,6 +40,10 @@ SUBSYS(freezer)
 SUBSYS(net_cls)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_NET)
+SUBSYS(net)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_PERF)
 SUBSYS(perf_event)
 #endif
diff --git a/include/net/net_cgroup.h b/include/net/net_cgroup.h
new file mode 100644
index 0000000..8e98803
--- /dev/null
+++ b/include/net/net_cgroup.h
@@ -0,0 +1,27 @@
+/*
+ * net_cgroup.h			Networking Control Group
+ *
+ * Copyright (C) 2016 Google, Inc.
+ *
+ * Authors:	Anoop Naravaram <anaravaram@google.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _NET_CGROUP_H
+#define _NET_CGROUP_H
+
+#include <linux/cgroup.h>
+
+#ifdef CONFIG_CGROUP_NET
+
+struct net_cgroup {
+	struct cgroup_subsys_state	css;
+};
+
+#endif /* CONFIG_CGROUP_NET */
+#endif  /* _NET_CGROUP_H */
diff --git a/net/Kconfig b/net/Kconfig
index c2cdbce..47f68bd 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -278,6 +278,16 @@ config CGROUP_NET_CLASSID
 	  Cgroup subsystem for use as general purpose socket classid marker that is
 	  being used in cls_cgroup and for netfilter matching.
 
+config CGROUP_NET
+	bool "Networking cgroup"
+	depends on CGROUPS
+	---help---
+	  Cgroup subsystem for use in managing several networking properties,
+	  such as restricting which ports are available for processes to bind
+	  and listen on, restricting which dscp values processes can use with
+	  their sockets, and limiting the number of udp ports that can be
+	  acquired by processes from the cgroup.
+
 config NET_RX_BUSY_POLL
 	bool
 	default y
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..9dbc8b6 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
 obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
+obj-$(CONFIG_CGROUP_NET) += net_cgroup.o
 obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
diff --git a/net/core/net_cgroup.c b/net/core/net_cgroup.c
new file mode 100644
index 0000000..3a46960
--- /dev/null
+++ b/net/core/net_cgroup.c
@@ -0,0 +1,62 @@
+/*
+ * net/core/net_cgroup.c	Networking Control Group
+ *
+ * Copyright (C) 2016 Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * Authors:	Anoop Naravaram <anaravaram@google.com>
+ */
+
+#include <linux/slab.h>
+#include <net/net_cgroup.h>
+
+static struct net_cgroup *css_to_net_cgroup(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct net_cgroup, css) : NULL;
+}
+
+static struct net_cgroup *task_to_net_cgroup(struct task_struct *p)
+{
+	return css_to_net_cgroup(task_css(p, net_cgrp_id));
+}
+
+static struct net_cgroup *net_cgroup_to_parent(struct net_cgroup *netcg)
+{
+	return css_to_net_cgroup(netcg->css.parent);
+}
+
+static void free_net_cgroup(struct net_cgroup *netcg)
+{
+	kfree(netcg);
+}
+
+static struct cgroup_subsys_state *
+cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct net_cgroup *netcg;
+
+	netcg = kzalloc(sizeof(*netcg), GFP_KERNEL);
+	if (!netcg)
+		return ERR_PTR(-ENOMEM);
+
+	return &netcg->css;
+}
+
+static void cgrp_css_free(struct cgroup_subsys_state *css)
+{
+	free_net_cgroup(css_to_net_cgroup(css));
+}
+
+static struct cftype ss_files[] = {
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys net_cgrp_subsys = {
+	.css_alloc		= cgrp_css_alloc,
+	.css_free		= cgrp_css_free,
+	.legacy_cftypes		= ss_files,
+};
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] net: add bind/listen ranges to net cgroup
  2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
@ 2016-08-11  0:53   ` Anoop Naravaram
  2016-08-11  0:53   ` Anoop Naravaram
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev
  Cc: edumazet, maheshb, weiwan, tom, Anoop Naravaram

bind port ranges
----------------
This property controls which ports the processes in a cgroup are allowed
to bind to. If a process in a cgroup tries to bind a socket to a port
that is not within the range(s) permitted by the cgroup, it will receive an
EACCES error.

>From userspace, you can get or set the bind port ranges by accessing the
'net.bind_port_ranges' file. To set the ranges of a cgroup, write the
comma-separated ranges to the file, where each range could be either a
pair of ports separated by a hyphen (-), or just an individual port. For
example, say you want to allow all the processes in a cgroup to be allowed
to bind to ports 100 through 200 (inclusive), 300 through 320 (inclusive)
and 350. Then you can write the string "100-200,300-320,350" to the
'net.bind_port_ranges' file. When reading the file, any individual ports
will be read as a "start-end" range where the start and end are equal.
The example above would be read as "100-200,300-320,350-350".

The controller imposes the invariant that the ranges of any cgroup must
be a subset (or equal set) of the ranges of its parents (i.e., processes
in a cgroup cannot be allowed to bind to any port that is not allowed
by the parent cgroup). This constraint allows you to ensure that not
only do the processes in a cgroup follow the bind range, but so do the
processes in any of the cgroup's descendants. The way this is enforced
is because of two things: 1) when a cgroup is initialized, its ranges
are inherited from its parent, and 2) when attempting to set the ranges of
a cgroup, the kernel ensures that the condition is true for the current
cgroup and all its children, or otherwise fails to change the ranges
with error EINVAL.

listen port ranges
------------------
This property controls which ports the processes in a cgroup are allowed
to listen on. If a process in a cgroup tries to listen using a socket
bound to a port that is not within the range(s) permitted by the cgroup,
it will receive an EACCES error.

Configuring this property works the same way as with bind port ranges,
except using the file 'net.listen_port_ranges' instead of
'net.bind_port_ranges'. The range subset invariant is imposed
independently for bind and listen port ranges. For now the kernel does
not enforce that the listen range must be a subset of the bind range.

Tested: Used a python unittest to set the range and try
binding/listening to ports inside and outside the range, and ensure
that an error occurred only when it should. Also, ensures that an error
occurs when trying to violate the subset condition.

Signed-off-by: Anoop Naravaram <anaravaram@google.com>
---
 Documentation/cgroup-v1/net.txt |  46 ++++++
 include/net/net_cgroup.h        |  41 +++++
 net/core/net_cgroup.c           | 341 ++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c              |   8 +
 net/ipv4/inet_connection_sock.c |   7 +
 net/ipv6/af_inet6.c             |   7 +
 6 files changed, 450 insertions(+)

diff --git a/Documentation/cgroup-v1/net.txt b/Documentation/cgroup-v1/net.txt
index 580c214..8c50c61 100644
--- a/Documentation/cgroup-v1/net.txt
+++ b/Documentation/cgroup-v1/net.txt
@@ -7,3 +7,49 @@ properties for each process group:
 * listen port ranges
 * dscp ranges
 * udp port usage and limit
+
+bind port ranges
+----------------
+This property controls which ports the processes in a cgroup are allowed
+to bind to. If a process in a cgroup tries to bind a socket to a port
+that is not within the range(s) permitted by the cgroup, it will receive an
+EACCES error.
+
+This property is exposed to userspace through the 'net.bind_port_ranges' file,
+as ranges of ports that the processes can bind to (as described in the HOW TO
+INTERACT WITH RANGES FILES section).
+
+listen port ranges
+------------------
+This property controls which ports the processes in a cgroup are allowed
+to listen on. If a process in a cgroup tries to listen using a socket
+bound to a port that is not within the range(s) permitted by the cgroup,
+it will receive an EACCES error.
+
+This property is exposed to userspace through the 'net.listen_port_ranges' file,
+as ranges of ports that the processes can listen on (as described in the HOW TO
+INTERACT WITH RANGES FILES section).
+
+HOW TO INTERACT WITH RANGES FILES
+---------------------------------
+Some cgroup properties can be expressed as ranges of allowed integers. From
+userspace, you can get or set them by accessing the cgroup file corresponding to
+the property you want to interact with. To set the ranges, write a list of
+comma-separated ranges to the file, where each range could be either a pair of
+integers separated by a hyphen (-), or just an individual integer. For example,
+say you want a cgroup to allow the integers 100 through 200 (inclusive), 300
+through 320 (inclusive) and 350. Then you can write the string
+"100-200,300-320,350" to the file. When reading the file, any individual
+integers will be read as a "start-end" range where the start and end are equal.
+The example above would be read as "100-200,300-320,350-350".
+
+The controller imposes the invariant that the ranges allowed by any cgroup must
+be a subset (or equal set) of the ranges allowed by its parent (i.e., a cgroup
+does not allow any integers not allowed by its parent cgroup). This constraint
+allows you to ensure that not only are the processes in any given cgroup
+contrained by its ranges, but so are the processes in any of the cgroup's
+descendants. The way this is enforced is by two things: 1) when a cgroup is
+initialized, its ranges are inherited from its parent, and 2) when attempting to
+set the ranges of a cgroup, the kernel ensures that the invariant is true for
+the current cgroup and all its children, or otherwise fails to change the ranges
+with error EINVAL.
diff --git a/include/net/net_cgroup.h b/include/net/net_cgroup.h
index 8e98803..6ee79d5 100644
--- a/include/net/net_cgroup.h
+++ b/include/net/net_cgroup.h
@@ -16,12 +16,53 @@
 #define _NET_CGROUP_H
 
 #include <linux/cgroup.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUP_NET
+/* range type */
+enum {
+	NETCG_LISTEN_RANGES,
+	NETCG_BIND_RANGES,
+	NETCG_NUM_RANGE_TYPES
+};
+
+struct net_range {
+	u16 min_value;
+	u16 max_value;
+};
+
+struct net_ranges {
+	int			num_entries;
+	struct rcu_head		rcu;
+	struct net_range	range[0];
+};
+
+struct net_range_types {
+	struct net_ranges __rcu	*ranges;
+	u16			lower_limit;
+	u16			upper_limit;
+};
 
 struct net_cgroup {
 	struct cgroup_subsys_state	css;
+
+	/* these fields are required for bind/listen port ranges */
+	struct mutex			range_lock;
+	struct net_range_types		whitelists[NETCG_NUM_RANGE_TYPES];
 };
 
+bool net_cgroup_bind_allowed(u16 port);
+bool net_cgroup_listen_allowed(u16 port);
+
+#else /* !CONFIG_CGROUP_NET */
+static inline bool net_cgroup_bind_allowed(u16 port)
+{
+	return true;
+}
+static inline bool net_cgroup_listen_allowed(u16 port)
+{
+	return true;
+}
+
 #endif /* CONFIG_CGROUP_NET */
 #endif  /* _NET_CGROUP_H */
diff --git a/net/core/net_cgroup.c b/net/core/net_cgroup.c
index 3a46960..7e69ad5 100644
--- a/net/core/net_cgroup.c
+++ b/net/core/net_cgroup.c
@@ -12,8 +12,19 @@
  */
 
 #include <linux/slab.h>
+#include <linux/ctype.h>
 #include <net/net_cgroup.h>
 
+#define BYTES_PER_ENTRY		sizeof(struct net_range)
+#define MAX_WRITE_SIZE		4096
+
+#define MIN_PORT_VALUE		0
+#define MAX_PORT_VALUE		65535
+
+/* Deriving MAX_ENTRIES from MAX_WRITE_SIZE as a rough estimate */
+#define MAX_ENTRIES ((MAX_WRITE_SIZE - offsetof(struct net_ranges, range)) /   \
+		     BYTES_PER_ENTRY)
+
 static struct net_cgroup *css_to_net_cgroup(struct cgroup_subsys_state *css)
 {
 	return css ? container_of(css, struct net_cgroup, css) : NULL;
@@ -29,8 +40,78 @@ static struct net_cgroup *net_cgroup_to_parent(struct net_cgroup *netcg)
 	return css_to_net_cgroup(netcg->css.parent);
 }
 
+static struct net_ranges *alloc_net_ranges(int num_entries)
+{
+	struct net_ranges *ranges;
+
+	ranges = kmalloc(offsetof(struct net_ranges, range[num_entries]),
+			 GFP_KERNEL);
+	if (!ranges)
+		return NULL;
+
+	ranges->num_entries = num_entries;
+
+	return ranges;
+}
+
+static int alloc_init_net_ranges(struct net_range_types *r, int min_value,
+				 int max_value)
+{
+	struct net_ranges *ranges;
+
+	ranges = alloc_net_ranges(1);
+	if (!ranges)
+		return -ENOMEM;
+
+	ranges->range[0].min_value = min_value;
+	ranges->range[0].max_value = max_value;
+	r->lower_limit = min_value;
+	r->upper_limit = max_value;
+	rcu_assign_pointer(r->ranges, ranges);
+
+	return 0;
+}
+
+static int alloc_copy_net_ranges(struct net_range_types *r,
+				 int min_value,
+				 int max_value,
+				 struct net_range_types *parent_rt)
+{
+	struct net_ranges *ranges;
+	struct net_ranges *parent_ranges;
+	int i; /* loop counter */
+
+	parent_ranges = rcu_dereference(parent_rt->ranges);
+	ranges = alloc_net_ranges(parent_ranges->num_entries);
+	if (!ranges)
+		return -ENOMEM;
+	for (i = 0; i < parent_ranges->num_entries; i++) {
+		ranges->range[i].min_value = parent_ranges->range[i].min_value;
+		ranges->range[i].max_value = parent_ranges->range[i].max_value;
+	}
+
+	r->lower_limit = min_value;
+	r->upper_limit = max_value;
+	rcu_assign_pointer(r->ranges, ranges);
+
+	return 0;
+}
+
 static void free_net_cgroup(struct net_cgroup *netcg)
 {
+	int i;
+
+	mutex_lock(&netcg->range_lock);
+	for (i = 0; i < NETCG_NUM_RANGE_TYPES; i++) {
+		struct net_ranges *range =
+		    rcu_dereference_protected(netcg->whitelists[i].ranges,
+					      1);
+
+		if (range)
+			kfree_rcu(range, rcu);
+	}
+	mutex_unlock(&netcg->range_lock);
+
 	kfree(netcg);
 }
 
@@ -38,11 +119,43 @@ static struct cgroup_subsys_state *
 cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct net_cgroup *netcg;
+	struct net_cgroup *parent_netcg = css_to_net_cgroup(parent_css);
 
 	netcg = kzalloc(sizeof(*netcg), GFP_KERNEL);
 	if (!netcg)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&netcg->range_lock);
+
+	/* allocate the listen and bind range whitelists */
+	if (!parent_netcg) {
+		/* if root, then init ranges with full range */
+		if (alloc_init_net_ranges(
+				&netcg->whitelists[NETCG_BIND_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE) ||
+		    alloc_init_net_ranges(
+				&netcg->whitelists[NETCG_LISTEN_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE)) {
+			free_net_cgroup(netcg);
+			/* if any of these cause an error, return ENOMEM */
+			return ERR_PTR(-ENOMEM);
+		}
+	} else {
+		/* if not root, then, inherit ranges from parent */
+		if (alloc_copy_net_ranges(
+				&netcg->whitelists[NETCG_BIND_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE,
+				&parent_netcg->whitelists[NETCG_BIND_RANGES]) ||
+		    alloc_copy_net_ranges(
+				&netcg->whitelists[NETCG_LISTEN_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE,
+				&parent_netcg->whitelists[NETCG_LISTEN_RANGES])) {
+			free_net_cgroup(netcg);
+			/* if any of these cause an error, return ENOMEM */
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+
 	return &netcg->css;
 }
 
@@ -51,7 +164,235 @@ static void cgrp_css_free(struct cgroup_subsys_state *css)
 	free_net_cgroup(css_to_net_cgroup(css));
 }
 
+static bool value_in_range(struct net_range_types *r, u16 val)
+{
+	int i;
+	struct net_ranges *ranges;
+
+	ranges = rcu_dereference(r->ranges);
+	for (i = 0; i < ranges->num_entries; i++) {
+		if (val >= ranges->range[i].min_value &&
+		    val <= ranges->range[i].max_value)
+			return true;
+	}
+
+	return false;
+}
+
+static bool net_cgroup_value_allowed(u16 value, int type)
+{
+	struct net_cgroup *netcg;
+	bool retval;
+
+	rcu_read_lock();
+	netcg = task_to_net_cgroup(current);
+	retval = value_in_range(&netcg->whitelists[type], value);
+	rcu_read_unlock();
+	return retval;
+}
+
+bool net_cgroup_bind_allowed(u16 port)
+{
+	return net_cgroup_value_allowed(port, NETCG_BIND_RANGES);
+}
+EXPORT_SYMBOL_GPL(net_cgroup_bind_allowed);
+
+bool net_cgroup_listen_allowed(u16 port)
+{
+	return net_cgroup_value_allowed(port, NETCG_LISTEN_RANGES);
+}
+EXPORT_SYMBOL_GPL(net_cgroup_listen_allowed);
+
+/* Returns true if the range r is a subset of at least one of the ranges in
+ * rs, and returns false otherwise.
+ */
+static bool range_in_ranges(struct net_range *r, struct net_ranges *rs)
+{
+	int ri;
+
+	for (ri = 0; ri < rs->num_entries; ri++)
+		if (r->min_value >= rs->range[ri].min_value &&
+		    r->max_value <= rs->range[ri].max_value)
+			return true;
+
+	return false;
+}
+
+/* Returns true if all the ranges in rs1 are subsets of at least one of the
+ * ranges in rs2, ans returns false otherwise.
+ */
+static bool ranges_in_ranges(struct net_ranges *rs1, struct net_ranges *rs2)
+{
+	int ri;
+
+	for (ri = 0; ri < rs1->num_entries; ri++)
+		if (!range_in_ranges(&rs1->range[ri], rs2))
+			return false;
+
+	return true;
+}
+
+static ssize_t update_ranges(struct net_cgroup *netcg, int type,
+			     const char *bp)
+{
+	unsigned int a, b;
+	int curr_index = 0;
+	ssize_t retval = 0;
+	struct net_ranges *ranges, *new, *old, *parent_ranges, *child_ranges;
+	struct cgroup_subsys_state *child_pos;
+	struct net_cgroup *child_netcg;
+
+	ranges = alloc_net_ranges(MAX_ENTRIES);
+	if (!ranges)
+		return -ENOMEM;
+
+	while (*bp != '\0' && *bp != '\n' && curr_index < MAX_ENTRIES) {
+		if (!isdigit(*bp)) {
+			retval = -EINVAL;
+			goto out;
+		}
+
+		a = simple_strtoul(bp, (char **)&bp, 10);
+		b = a;
+		if (*bp == '-') {
+			bp++;
+			if (!isdigit(*bp)) {
+				retval = -EINVAL;
+				goto out;
+			}
+			b = simple_strtoul(bp, (char **)&bp, 10);
+		}
+
+		if (!(a <= b)) {
+			retval = -EINVAL;
+			goto out;
+		}
+
+		if (a < netcg->whitelists[type].lower_limit ||
+		    b > netcg->whitelists[type].upper_limit) {
+			retval = -EINVAL;
+			goto out;
+		}
+
+		ranges->range[curr_index].min_value = a;
+		ranges->range[curr_index].max_value = b;
+
+		if (*bp == ',')
+			bp++;
+
+		curr_index++;
+	}
+
+	if (curr_index == MAX_ENTRIES) {
+		retval = -E2BIG;
+		goto out;
+	}
+
+	new = alloc_net_ranges(curr_index);
+	if (!new) {
+		retval = -ENOMEM;
+		goto out;
+	}
+
+	memcpy(new->range, ranges->range,
+	       sizeof(struct net_range) * curr_index);
+
+	/* make sure this cgroup is still a subset of its parent's */
+	parent_ranges = rcu_dereference(
+			net_cgroup_to_parent(netcg)->whitelists[type].ranges);
+	if (!ranges_in_ranges(new, parent_ranges)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	/* make sure children's ranges are still subsets of this cgroup's */
+	css_for_each_child(child_pos, &netcg->css) {
+		child_netcg = css_to_net_cgroup(child_pos);
+		child_ranges = rcu_dereference(
+				child_netcg->whitelists[type].ranges);
+		if (!ranges_in_ranges(child_ranges, new)) {
+			retval = -EINVAL;
+			goto out;
+		}
+	}
+
+	mutex_lock(&netcg->range_lock);
+	old = rcu_dereference_protected(netcg->whitelists[type].ranges, 1);
+	rcu_assign_pointer(netcg->whitelists[type].ranges, new);
+	mutex_unlock(&netcg->range_lock);
+
+	kfree_rcu(old, rcu);
+out:
+	kfree(ranges);
+	return retval;
+}
+
+static ssize_t net_write_ranges(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct net_cgroup *netcg = css_to_net_cgroup(of_css(of));
+	int type = of_cft(of)->private;
+
+	return update_ranges(netcg, type, buf) ?: nbytes;
+}
+
+static void net_seq_printf_list(struct seq_file *s, struct net_range_types *r)
+{
+	int i;
+	struct net_ranges *ranges;
+
+	ranges = rcu_dereference(r->ranges);
+
+	for (i = 0; i < ranges->num_entries; i++) {
+		if (i)
+			seq_puts(s, ",");
+		seq_printf(s, "%d-%d", ranges->range[i].min_value,
+			   ranges->range[i].max_value);
+	}
+	seq_puts(s, "\n");
+}
+
+static int net_read_ranges(struct seq_file *sf, void *v)
+{
+	struct net_cgroup *netcg = css_to_net_cgroup(seq_css(sf));
+	int type = seq_cft(sf)->private;
+
+	rcu_read_lock();
+	net_seq_printf_list(sf, &netcg->whitelists[type]);
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static struct cftype ss_files[] = {
+	{
+		.name		= "listen_port_ranges",
+		.flags		= CFTYPE_ONLY_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.private	= NETCG_LISTEN_RANGES,
+	},
+	{
+		.name		= "listen_port_ranges",
+		.flags		= CFTYPE_NOT_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.write		= net_write_ranges,
+		.private	= NETCG_LISTEN_RANGES,
+		.max_write_len	= MAX_WRITE_SIZE,
+	},
+	{
+		.name		= "bind_port_ranges",
+		.flags		= CFTYPE_ONLY_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.private	= NETCG_BIND_RANGES,
+	},
+	{
+		.name		= "bind_port_ranges",
+		.flags		= CFTYPE_NOT_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.write		= net_write_ranges,
+		.private	= NETCG_BIND_RANGES,
+		.max_write_len	= MAX_WRITE_SIZE,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 55513e6..c3160ad 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -120,6 +120,7 @@
 #include <linux/mroute.h>
 #endif
 #include <net/l3mdev.h>
+#include <net/net_cgroup.h>
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -497,6 +498,13 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		inet->inet_saddr = 0;  /* Use device */
 
 	/* Make sure we are allowed to bind here. */
+	if (!net_cgroup_bind_allowed(snum)) {
+		inet->inet_saddr = 0;
+		inet->inet_rcv_saddr = 0;
+		err = -EACCES;
+		goto out_release_sock;
+	}
+
 	if ((snum || !inet->bind_address_no_port) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 61a9dee..4fc3bd1 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -25,6 +25,7 @@
 #include <net/xfrm.h>
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
+#include <net/net_cgroup.h>
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
@@ -743,6 +744,11 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
 	sk->sk_ack_backlog = 0;
 	inet_csk_delack_init(sk);
 
+	if (!net_cgroup_listen_allowed(inet->inet_num)) {
+		err = -EACCES;
+		goto out;
+	}
+
 	/* There is race window here: we announce ourselves listening,
 	 * but this transition is still not validated by get_port().
 	 * It is OK, because this socket enters to hash table only
@@ -759,6 +765,7 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
 			return 0;
 	}
 
+out:
 	sk->sk_state = TCP_CLOSE;
 	return err;
 }
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2076c21..9328240 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -65,6 +65,7 @@
 #include <linux/mroute6.h>
 
 #include "ip6_offload.h"
+#include <net/net_cgroup.h>
 
 MODULE_AUTHOR("Cast of dozens");
 MODULE_DESCRIPTION("IPv6 protocol stack for Linux");
@@ -379,6 +380,12 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		np->saddr = addr->sin6_addr;
 
 	/* Make sure we are allowed to bind here. */
+	if (!net_cgroup_bind_allowed(snum)) {
+		inet_reset_saddr(sk);
+		err = -EACCES;
+		goto out;
+	}
+
 	if ((snum || !inet->bind_address_no_port) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet_reset_saddr(sk);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] net: add bind/listen ranges to net cgroup
@ 2016-08-11  0:53   ` Anoop Naravaram
  0 siblings, 0 replies; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev
  Cc: edumazet, maheshb, weiwan, tom, Anoop Naravaram

bind port ranges
----------------
This property controls which ports the processes in a cgroup are allowed
to bind to. If a process in a cgroup tries to bind a socket to a port
that is not within the range(s) permitted by the cgroup, it will receive an
EACCES error.

From userspace, you can get or set the bind port ranges by accessing the
'net.bind_port_ranges' file. To set the ranges of a cgroup, write the
comma-separated ranges to the file, where each range could be either a
pair of ports separated by a hyphen (-), or just an individual port. For
example, say you want to allow all the processes in a cgroup to be allowed
to bind to ports 100 through 200 (inclusive), 300 through 320 (inclusive)
and 350. Then you can write the string "100-200,300-320,350" to the
'net.bind_port_ranges' file. When reading the file, any individual ports
will be read as a "start-end" range where the start and end are equal.
The example above would be read as "100-200,300-320,350-350".

The controller imposes the invariant that the ranges of any cgroup must
be a subset (or equal set) of the ranges of its parents (i.e., processes
in a cgroup cannot be allowed to bind to any port that is not allowed
by the parent cgroup). This constraint allows you to ensure that not
only do the processes in a cgroup follow the bind range, but so do the
processes in any of the cgroup's descendants. The way this is enforced
is because of two things: 1) when a cgroup is initialized, its ranges
are inherited from its parent, and 2) when attempting to set the ranges of
a cgroup, the kernel ensures that the condition is true for the current
cgroup and all its children, or otherwise fails to change the ranges
with error EINVAL.

listen port ranges
------------------
This property controls which ports the processes in a cgroup are allowed
to listen on. If a process in a cgroup tries to listen using a socket
bound to a port that is not within the range(s) permitted by the cgroup,
it will receive an EACCES error.

Configuring this property works the same way as with bind port ranges,
except using the file 'net.listen_port_ranges' instead of
'net.bind_port_ranges'. The range subset invariant is imposed
independently for bind and listen port ranges. For now the kernel does
not enforce that the listen range must be a subset of the bind range.

Tested: Used a python unittest to set the range and try
binding/listening to ports inside and outside the range, and ensure
that an error occurred only when it should. Also, ensures that an error
occurs when trying to violate the subset condition.

Signed-off-by: Anoop Naravaram <anaravaram@google.com>
---
 Documentation/cgroup-v1/net.txt |  46 ++++++
 include/net/net_cgroup.h        |  41 +++++
 net/core/net_cgroup.c           | 341 ++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c              |   8 +
 net/ipv4/inet_connection_sock.c |   7 +
 net/ipv6/af_inet6.c             |   7 +
 6 files changed, 450 insertions(+)

diff --git a/Documentation/cgroup-v1/net.txt b/Documentation/cgroup-v1/net.txt
index 580c214..8c50c61 100644
--- a/Documentation/cgroup-v1/net.txt
+++ b/Documentation/cgroup-v1/net.txt
@@ -7,3 +7,49 @@ properties for each process group:
 * listen port ranges
 * dscp ranges
 * udp port usage and limit
+
+bind port ranges
+----------------
+This property controls which ports the processes in a cgroup are allowed
+to bind to. If a process in a cgroup tries to bind a socket to a port
+that is not within the range(s) permitted by the cgroup, it will receive an
+EACCES error.
+
+This property is exposed to userspace through the 'net.bind_port_ranges' file,
+as ranges of ports that the processes can bind to (as described in the HOW TO
+INTERACT WITH RANGES FILES section).
+
+listen port ranges
+------------------
+This property controls which ports the processes in a cgroup are allowed
+to listen on. If a process in a cgroup tries to listen using a socket
+bound to a port that is not within the range(s) permitted by the cgroup,
+it will receive an EACCES error.
+
+This property is exposed to userspace through the 'net.listen_port_ranges' file,
+as ranges of ports that the processes can listen on (as described in the HOW TO
+INTERACT WITH RANGES FILES section).
+
+HOW TO INTERACT WITH RANGES FILES
+---------------------------------
+Some cgroup properties can be expressed as ranges of allowed integers. From
+userspace, you can get or set them by accessing the cgroup file corresponding to
+the property you want to interact with. To set the ranges, write a list of
+comma-separated ranges to the file, where each range could be either a pair of
+integers separated by a hyphen (-), or just an individual integer. For example,
+say you want a cgroup to allow the integers 100 through 200 (inclusive), 300
+through 320 (inclusive) and 350. Then you can write the string
+"100-200,300-320,350" to the file. When reading the file, any individual
+integers will be read as a "start-end" range where the start and end are equal.
+The example above would be read as "100-200,300-320,350-350".
+
+The controller imposes the invariant that the ranges allowed by any cgroup must
+be a subset (or equal set) of the ranges allowed by its parent (i.e., a cgroup
+does not allow any integers not allowed by its parent cgroup). This constraint
+allows you to ensure that not only are the processes in any given cgroup
+contrained by its ranges, but so are the processes in any of the cgroup's
+descendants. The way this is enforced is by two things: 1) when a cgroup is
+initialized, its ranges are inherited from its parent, and 2) when attempting to
+set the ranges of a cgroup, the kernel ensures that the invariant is true for
+the current cgroup and all its children, or otherwise fails to change the ranges
+with error EINVAL.
diff --git a/include/net/net_cgroup.h b/include/net/net_cgroup.h
index 8e98803..6ee79d5 100644
--- a/include/net/net_cgroup.h
+++ b/include/net/net_cgroup.h
@@ -16,12 +16,53 @@
 #define _NET_CGROUP_H
 
 #include <linux/cgroup.h>
+#include <linux/types.h>
 
 #ifdef CONFIG_CGROUP_NET
+/* range type */
+enum {
+	NETCG_LISTEN_RANGES,
+	NETCG_BIND_RANGES,
+	NETCG_NUM_RANGE_TYPES
+};
+
+struct net_range {
+	u16 min_value;
+	u16 max_value;
+};
+
+struct net_ranges {
+	int			num_entries;
+	struct rcu_head		rcu;
+	struct net_range	range[0];
+};
+
+struct net_range_types {
+	struct net_ranges __rcu	*ranges;
+	u16			lower_limit;
+	u16			upper_limit;
+};
 
 struct net_cgroup {
 	struct cgroup_subsys_state	css;
+
+	/* these fields are required for bind/listen port ranges */
+	struct mutex			range_lock;
+	struct net_range_types		whitelists[NETCG_NUM_RANGE_TYPES];
 };
 
+bool net_cgroup_bind_allowed(u16 port);
+bool net_cgroup_listen_allowed(u16 port);
+
+#else /* !CONFIG_CGROUP_NET */
+static inline bool net_cgroup_bind_allowed(u16 port)
+{
+	return true;
+}
+static inline bool net_cgroup_listen_allowed(u16 port)
+{
+	return true;
+}
+
 #endif /* CONFIG_CGROUP_NET */
 #endif  /* _NET_CGROUP_H */
diff --git a/net/core/net_cgroup.c b/net/core/net_cgroup.c
index 3a46960..7e69ad5 100644
--- a/net/core/net_cgroup.c
+++ b/net/core/net_cgroup.c
@@ -12,8 +12,19 @@
  */
 
 #include <linux/slab.h>
+#include <linux/ctype.h>
 #include <net/net_cgroup.h>
 
+#define BYTES_PER_ENTRY		sizeof(struct net_range)
+#define MAX_WRITE_SIZE		4096
+
+#define MIN_PORT_VALUE		0
+#define MAX_PORT_VALUE		65535
+
+/* Deriving MAX_ENTRIES from MAX_WRITE_SIZE as a rough estimate */
+#define MAX_ENTRIES ((MAX_WRITE_SIZE - offsetof(struct net_ranges, range)) /   \
+		     BYTES_PER_ENTRY)
+
 static struct net_cgroup *css_to_net_cgroup(struct cgroup_subsys_state *css)
 {
 	return css ? container_of(css, struct net_cgroup, css) : NULL;
@@ -29,8 +40,78 @@ static struct net_cgroup *net_cgroup_to_parent(struct net_cgroup *netcg)
 	return css_to_net_cgroup(netcg->css.parent);
 }
 
+static struct net_ranges *alloc_net_ranges(int num_entries)
+{
+	struct net_ranges *ranges;
+
+	ranges = kmalloc(offsetof(struct net_ranges, range[num_entries]),
+			 GFP_KERNEL);
+	if (!ranges)
+		return NULL;
+
+	ranges->num_entries = num_entries;
+
+	return ranges;
+}
+
+static int alloc_init_net_ranges(struct net_range_types *r, int min_value,
+				 int max_value)
+{
+	struct net_ranges *ranges;
+
+	ranges = alloc_net_ranges(1);
+	if (!ranges)
+		return -ENOMEM;
+
+	ranges->range[0].min_value = min_value;
+	ranges->range[0].max_value = max_value;
+	r->lower_limit = min_value;
+	r->upper_limit = max_value;
+	rcu_assign_pointer(r->ranges, ranges);
+
+	return 0;
+}
+
+static int alloc_copy_net_ranges(struct net_range_types *r,
+				 int min_value,
+				 int max_value,
+				 struct net_range_types *parent_rt)
+{
+	struct net_ranges *ranges;
+	struct net_ranges *parent_ranges;
+	int i; /* loop counter */
+
+	parent_ranges = rcu_dereference(parent_rt->ranges);
+	ranges = alloc_net_ranges(parent_ranges->num_entries);
+	if (!ranges)
+		return -ENOMEM;
+	for (i = 0; i < parent_ranges->num_entries; i++) {
+		ranges->range[i].min_value = parent_ranges->range[i].min_value;
+		ranges->range[i].max_value = parent_ranges->range[i].max_value;
+	}
+
+	r->lower_limit = min_value;
+	r->upper_limit = max_value;
+	rcu_assign_pointer(r->ranges, ranges);
+
+	return 0;
+}
+
 static void free_net_cgroup(struct net_cgroup *netcg)
 {
+	int i;
+
+	mutex_lock(&netcg->range_lock);
+	for (i = 0; i < NETCG_NUM_RANGE_TYPES; i++) {
+		struct net_ranges *range =
+		    rcu_dereference_protected(netcg->whitelists[i].ranges,
+					      1);
+
+		if (range)
+			kfree_rcu(range, rcu);
+	}
+	mutex_unlock(&netcg->range_lock);
+
 	kfree(netcg);
 }
 
@@ -38,11 +119,43 @@ static struct cgroup_subsys_state *
 cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct net_cgroup *netcg;
+	struct net_cgroup *parent_netcg = css_to_net_cgroup(parent_css);
 
 	netcg = kzalloc(sizeof(*netcg), GFP_KERNEL);
 	if (!netcg)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&netcg->range_lock);
+
+	/* allocate the listen and bind range whitelists */
+	if (!parent_netcg) {
+		/* if root, then init ranges with full range */
+		if (alloc_init_net_ranges(
+				&netcg->whitelists[NETCG_BIND_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE) ||
+		    alloc_init_net_ranges(
+				&netcg->whitelists[NETCG_LISTEN_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE)) {
+			free_net_cgroup(netcg);
+			/* if any of these cause an error, return ENOMEM */
+			return ERR_PTR(-ENOMEM);
+		}
+	} else {
+		/* if not root, then, inherit ranges from parent */
+		if (alloc_copy_net_ranges(
+				&netcg->whitelists[NETCG_BIND_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE,
+				&parent_netcg->whitelists[NETCG_BIND_RANGES]) ||
+		    alloc_copy_net_ranges(
+				&netcg->whitelists[NETCG_LISTEN_RANGES],
+				MIN_PORT_VALUE, MAX_PORT_VALUE,
+				&parent_netcg->whitelists[NETCG_LISTEN_RANGES])) {
+			free_net_cgroup(netcg);
+			/* if any of these cause an error, return ENOMEM */
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+
 	return &netcg->css;
 }
 
@@ -51,7 +164,235 @@ static void cgrp_css_free(struct cgroup_subsys_state *css)
 	free_net_cgroup(css_to_net_cgroup(css));
 }
 
+static bool value_in_range(struct net_range_types *r, u16 val)
+{
+	int i;
+	struct net_ranges *ranges;
+
+	ranges = rcu_dereference(r->ranges);
+	for (i = 0; i < ranges->num_entries; i++) {
+		if (val >= ranges->range[i].min_value &&
+		    val <= ranges->range[i].max_value)
+			return true;
+	}
+
+	return false;
+}
+
+static bool net_cgroup_value_allowed(u16 value, int type)
+{
+	struct net_cgroup *netcg;
+	bool retval;
+
+	rcu_read_lock();
+	netcg = task_to_net_cgroup(current);
+	retval = value_in_range(&netcg->whitelists[type], value);
+	rcu_read_unlock();
+	return retval;
+}
+
+bool net_cgroup_bind_allowed(u16 port)
+{
+	return net_cgroup_value_allowed(port, NETCG_BIND_RANGES);
+}
+EXPORT_SYMBOL_GPL(net_cgroup_bind_allowed);
+
+bool net_cgroup_listen_allowed(u16 port)
+{
+	return net_cgroup_value_allowed(port, NETCG_LISTEN_RANGES);
+}
+EXPORT_SYMBOL_GPL(net_cgroup_listen_allowed);
+
+/* Returns true if the range r is a subset of at least one of the ranges in
+ * rs, and returns false otherwise.
+ */
+static bool range_in_ranges(struct net_range *r, struct net_ranges *rs)
+{
+	int ri;
+
+	for (ri = 0; ri < rs->num_entries; ri++)
+		if (r->min_value >= rs->range[ri].min_value &&
+		    r->max_value <= rs->range[ri].max_value)
+			return true;
+
+	return false;
+}
+
+/* Returns true if all the ranges in rs1 are subsets of at least one of the
+ * ranges in rs2, ans returns false otherwise.
+ */
+static bool ranges_in_ranges(struct net_ranges *rs1, struct net_ranges *rs2)
+{
+	int ri;
+
+	for (ri = 0; ri < rs1->num_entries; ri++)
+		if (!range_in_ranges(&rs1->range[ri], rs2))
+			return false;
+
+	return true;
+}
+
+static ssize_t update_ranges(struct net_cgroup *netcg, int type,
+			     const char *bp)
+{
+	unsigned int a, b;
+	int curr_index = 0;
+	ssize_t retval = 0;
+	struct net_ranges *ranges, *new, *old, *parent_ranges, *child_ranges;
+	struct cgroup_subsys_state *child_pos;
+	struct net_cgroup *child_netcg;
+
+	ranges = alloc_net_ranges(MAX_ENTRIES);
+	if (!ranges)
+		return -ENOMEM;
+
+	while (*bp != '\0' && *bp != '\n' && curr_index < MAX_ENTRIES) {
+		if (!isdigit(*bp)) {
+			retval = -EINVAL;
+			goto out;
+		}
+
+		a = simple_strtoul(bp, (char **)&bp, 10);
+		b = a;
+		if (*bp == '-') {
+			bp++;
+			if (!isdigit(*bp)) {
+				retval = -EINVAL;
+				goto out;
+			}
+			b = simple_strtoul(bp, (char **)&bp, 10);
+		}
+
+		if (!(a <= b)) {
+			retval = -EINVAL;
+			goto out;
+		}
+
+		if (a < netcg->whitelists[type].lower_limit ||
+		    b > netcg->whitelists[type].upper_limit) {
+			retval = -EINVAL;
+			goto out;
+		}
+
+		ranges->range[curr_index].min_value = a;
+		ranges->range[curr_index].max_value = b;
+
+		if (*bp == ',')
+			bp++;
+
+		curr_index++;
+	}
+
+	if (curr_index == MAX_ENTRIES) {
+		retval = -E2BIG;
+		goto out;
+	}
+
+	new = alloc_net_ranges(curr_index);
+	if (!new) {
+		retval = -ENOMEM;
+		goto out;
+	}
+
+	memcpy(new->range, ranges->range,
+	       sizeof(struct net_range) * curr_index);
+
+	/* make sure this cgroup is still a subset of its parent's */
+	parent_ranges = rcu_dereference(
+			net_cgroup_to_parent(netcg)->whitelists[type].ranges);
+	if (!ranges_in_ranges(new, parent_ranges)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	/* make sure children's ranges are still subsets of this cgroup's */
+	css_for_each_child(child_pos, &netcg->css) {
+		child_netcg = css_to_net_cgroup(child_pos);
+		child_ranges = rcu_dereference(
+				child_netcg->whitelists[type].ranges);
+		if (!ranges_in_ranges(child_ranges, new)) {
+			retval = -EINVAL;
+			goto out;
+		}
+	}
+
+	mutex_lock(&netcg->range_lock);
+	old = rcu_dereference_protected(netcg->whitelists[type].ranges, 1);
+	rcu_assign_pointer(netcg->whitelists[type].ranges, new);
+	mutex_unlock(&netcg->range_lock);
+
+	kfree_rcu(old, rcu);
+out:
+	kfree(ranges);
+	return retval;
+}
+
+static ssize_t net_write_ranges(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct net_cgroup *netcg = css_to_net_cgroup(of_css(of));
+	int type = of_cft(of)->private;
+
+	return update_ranges(netcg, type, buf) ?: nbytes;
+}
+
+static void net_seq_printf_list(struct seq_file *s, struct net_range_types *r)
+{
+	int i;
+	struct net_ranges *ranges;
+
+	ranges = rcu_dereference(r->ranges);
+
+	for (i = 0; i < ranges->num_entries; i++) {
+		if (i)
+			seq_puts(s, ",");
+		seq_printf(s, "%d-%d", ranges->range[i].min_value,
+			   ranges->range[i].max_value);
+	}
+	seq_puts(s, "\n");
+}
+
+static int net_read_ranges(struct seq_file *sf, void *v)
+{
+	struct net_cgroup *netcg = css_to_net_cgroup(seq_css(sf));
+	int type = seq_cft(sf)->private;
+
+	rcu_read_lock();
+	net_seq_printf_list(sf, &netcg->whitelists[type]);
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static struct cftype ss_files[] = {
+	{
+		.name		= "listen_port_ranges",
+		.flags		= CFTYPE_ONLY_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.private	= NETCG_LISTEN_RANGES,
+	},
+	{
+		.name		= "listen_port_ranges",
+		.flags		= CFTYPE_NOT_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.write		= net_write_ranges,
+		.private	= NETCG_LISTEN_RANGES,
+		.max_write_len	= MAX_WRITE_SIZE,
+	},
+	{
+		.name		= "bind_port_ranges",
+		.flags		= CFTYPE_ONLY_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.private	= NETCG_BIND_RANGES,
+	},
+	{
+		.name		= "bind_port_ranges",
+		.flags		= CFTYPE_NOT_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.write		= net_write_ranges,
+		.private	= NETCG_BIND_RANGES,
+		.max_write_len	= MAX_WRITE_SIZE,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 55513e6..c3160ad 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -120,6 +120,7 @@
 #include <linux/mroute.h>
 #endif
 #include <net/l3mdev.h>
+#include <net/net_cgroup.h>
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -497,6 +498,13 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		inet->inet_saddr = 0;  /* Use device */
 
 	/* Make sure we are allowed to bind here. */
+	if (!net_cgroup_bind_allowed(snum)) {
+		inet->inet_saddr = 0;
+		inet->inet_rcv_saddr = 0;
+		err = -EACCES;
+		goto out_release_sock;
+	}
+
 	if ((snum || !inet->bind_address_no_port) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 61a9dee..4fc3bd1 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -25,6 +25,7 @@
 #include <net/xfrm.h>
 #include <net/tcp.h>
 #include <net/sock_reuseport.h>
+#include <net/net_cgroup.h>
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
@@ -743,6 +744,11 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
 	sk->sk_ack_backlog = 0;
 	inet_csk_delack_init(sk);
 
+	if (!net_cgroup_listen_allowed(inet->inet_num)) {
+		err = -EACCES;
+		goto out;
+	}
+
 	/* There is race window here: we announce ourselves listening,
 	 * but this transition is still not validated by get_port().
 	 * It is OK, because this socket enters to hash table only
@@ -759,6 +765,7 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
 			return 0;
 	}
 
+out:
 	sk->sk_state = TCP_CLOSE;
 	return err;
 }
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2076c21..9328240 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -65,6 +65,7 @@
 #include <linux/mroute6.h>
 
 #include "ip6_offload.h"
+#include <net/net_cgroup.h>
 
 MODULE_AUTHOR("Cast of dozens");
 MODULE_DESCRIPTION("IPv6 protocol stack for Linux");
@@ -379,6 +380,12 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		np->saddr = addr->sin6_addr;
 
 	/* Make sure we are allowed to bind here. */
+	if (!net_cgroup_bind_allowed(snum)) {
+		inet_reset_saddr(sk);
+		err = -EACCES;
+		goto out;
+	}
+
 	if ((snum || !inet->bind_address_no_port) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet_reset_saddr(sk);
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] net: add udp limit to net cgroup
       [not found] ` <1470876798-4024-1-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2016-08-11  0:53   ` Anoop Naravaram
       [not found]     ` <1470876798-4024-4-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2016-08-23  7:37   ` [PATCH 0/5] Networking cgroup controller Parav Pandit
  1 sibling, 1 reply; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet-T1hC0tSOHrs, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: edumazet-hpIqsD4AKlfQT0dZR+AlfA, maheshb-hpIqsD4AKlfQT0dZR+AlfA,
	weiwan-hpIqsD4AKlfQT0dZR+AlfA, tom-BjP2VixgY4xUbtYUoyoikg,
	Anoop Naravaram

udp port limit
--------------
This property controls the limit of udp ports that can be used by the
processes in a cgroup. The controller manages udp statistics (usage,
limit, etc) for each cgroup. Every cgroup also keeps track of the udp
ports acquired by its descendants. If a process tries to acquire a port
when its cgroup has already reached its limit, it will fail with error
EACCES. It will also fail if one of the cgroup's ancestors has reached its
limit. There are 5 files exposed to userspace to configure this property:

* 'net.udp_usage': Reading this file gives the number of udp ports used by
processes in this cgroup and all its descendants.
* 'net.udp_limit': Writing this file sets the total number of udp ports
that can be used by processes in this cgroup and all
its descendants. This file can also be read.
* 'net.udp_maxusage': Reading this file gives the highest value of
net.udp_usage that has been seen for this cgroup.
* 'net.udp_failcnt': Reading this file gives the number of times a
process in this cgroup or one of its descendants has attempted to acquire
a udp port but failed because the limit of this cgroup was reached.
* 'net.udp_underflowcnt': Reading this file gives the number of times a
process in this cgroup or one of its descendants released a udp port when
the usage value of this cgroup was 0.

When a new cgroup is created, its udp limit is copied from its parent.

Tested: Set the udp limit, then used python to use several udp ports,
ensuring that it is successful up until the limit, after which there
should be an error. Also tried different limits at different levels of the
hierarchy.

Signed-off-by: Anoop Naravaram <anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroup-v1/net.txt |  26 ++++
 include/net/net_cgroup.h        |  29 +++++
 net/core/net_cgroup.c           | 273 ++++++++++++++++++++++++++++++++++++++++
 net/ipv4/udp.c                  |   8 ++
 4 files changed, 336 insertions(+)

diff --git a/Documentation/cgroup-v1/net.txt b/Documentation/cgroup-v1/net.txt
index 8c50c61..a14fd1c 100644
--- a/Documentation/cgroup-v1/net.txt
+++ b/Documentation/cgroup-v1/net.txt
@@ -30,6 +30,32 @@ This property is exposed to userspace through the 'net.listen_port_ranges' file,
 as ranges of ports that the processes can listen on (as described in the HOW TO
 INTERACT WITH RANGES FILES section).
 
+udp port usage and limit
+------------------------
+This property controls the limit of udp ports that can be used by the
+processes in a cgroup. The controller manages udp statistics (usage, limit, etc)
+for each cgroup. Every cgroup also keeps track of the udp ports acquired by its
+descendants. If a process tries to acquire a port when its cgroup has
+already reached its limit, it will fail with error EACCES. It will also fail if
+one of the cgroup's ancestors has reached its limit. There are 5 files
+exposed to userspace to configure this property:
+
+* 'net.udp_usage': Reading this file gives the number of udp ports used by
+processes in this cgroup and all its descendants.
+* 'net.udp_limit': Writing this file sets the total number of udp ports
+that can be used by processes in this cgroup and all
+its descendants. This file can also be read.
+* 'net.udp_maxusage': Reading this file gives the highest value of
+net.udp_usage that has been seen for this cgroup.
+* 'net.udp_failcnt': Reading this file gives the number of times a
+process in this cgroup or one of its descendants has attempted to acquire a
+udp port but failed because the limit of this cgroup was reached.
+* 'net.udp_underflowcnt': Reading this file gives the number of times a
+process in this cgroup or one of its descendants released a udp port when the
+usage value of this cgroup was 0.
+
+When a new cgroup is created, its udp limit is copied from its parent.
+
 HOW TO INTERACT WITH RANGES FILES
 ---------------------------------
 Some cgroup properties can be expressed as ranges of allowed integers. From
diff --git a/include/net/net_cgroup.h b/include/net/net_cgroup.h
index 6ee79d5..25a9def 100644
--- a/include/net/net_cgroup.h
+++ b/include/net/net_cgroup.h
@@ -26,6 +26,16 @@ enum {
 	NETCG_NUM_RANGE_TYPES
 };
 
+/* udp statistic type */
+enum {
+	NETCG_LIMIT_UDP,
+	NETCG_USAGE_UDP,
+	NETCG_MAXUSAGE_UDP,
+	NETCG_FAILCNT_UDP,
+	NETCG_UNDERFLOWCNT_UDP,
+	NETCG_NUM_UDP_STATS
+};
+
 struct net_range {
 	u16 min_value;
 	u16 max_value;
@@ -43,9 +53,19 @@ struct net_range_types {
 	u16			upper_limit;
 };
 
+struct cgroup_udp_stats {
+	/* Use atomics to protect against multiple writers */
+	atomic64_t	udp_limitandusage; /* 32MSB => limit, 32LSB => usage */
+	atomic_t	udp_maxusage;
+	atomic_t	udp_failcnt;
+	atomic_t	udp_underflowcnt;
+};
+
 struct net_cgroup {
 	struct cgroup_subsys_state	css;
 
+	struct cgroup_udp_stats		udp_stats;
+
 	/* these fields are required for bind/listen port ranges */
 	struct mutex			range_lock;
 	struct net_range_types		whitelists[NETCG_NUM_RANGE_TYPES];
@@ -53,6 +73,8 @@ struct net_cgroup {
 
 bool net_cgroup_bind_allowed(u16 port);
 bool net_cgroup_listen_allowed(u16 port);
+bool net_cgroup_acquire_udp_port(void);
+void net_cgroup_release_udp_port(void);
 
 #else /* !CONFIG_CGROUP_NET */
 static inline bool net_cgroup_bind_allowed(u16 port)
@@ -63,6 +85,13 @@ static inline bool net_cgroup_listen_allowed(u16 port)
 {
 	return true;
 }
+static inline bool net_cgroup_acquire_udp_port(void)
+{
+	return true;
+}
+static inline void net_cgroup_release_udp_port(void)
+{
+}
 
 #endif /* CONFIG_CGROUP_NET */
 #endif  /* _NET_CGROUP_H */
diff --git a/net/core/net_cgroup.c b/net/core/net_cgroup.c
index 7e69ad5..2f58e13 100644
--- a/net/core/net_cgroup.c
+++ b/net/core/net_cgroup.c
@@ -25,6 +25,31 @@
 #define MAX_ENTRIES ((MAX_WRITE_SIZE - offsetof(struct net_ranges, range)) /   \
 		     BYTES_PER_ENTRY)
 
+#define DEFAULT_UDP_LIMIT	-1
+#define UDP_FBITS		32
+#define UDP_FMASK		(((u64)1 << UDP_FBITS) - 1)
+
+/* Upper 32 bits are 'limit' and lower 32 bits are 'usage' */
+static s32 get_udp_limit(u64 limitandusage)
+{
+	return (s32)(limitandusage >> UDP_FBITS);
+}
+
+static s32 get_udp_usage(u64 limitandusage)
+{
+	return (s32)(limitandusage & UDP_FMASK);
+}
+
+static u64 set_udp_usage(u64 limitandusage, s32 usage)
+{
+	return (u64)((limitandusage & ~UDP_FMASK) | usage);
+}
+
+static u64 set_udp_limit_usage(s32 limit, s32 usage)
+{
+	return (u64)(((u64)limit << UDP_FBITS) | usage);
+}
+
 static struct net_cgroup *css_to_net_cgroup(struct cgroup_subsys_state *css)
 {
 	return css ? container_of(css, struct net_cgroup, css) : NULL;
@@ -120,6 +145,7 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct net_cgroup *netcg;
 	struct net_cgroup *parent_netcg = css_to_net_cgroup(parent_css);
+	s32 parent_udp_limit;
 
 	netcg = kzalloc(sizeof(*netcg), GFP_KERNEL);
 	if (!netcg)
@@ -140,6 +166,9 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 			/* if any of these cause an error, return ENOMEM */
 			return ERR_PTR(-ENOMEM);
 		}
+		/* and set no limit on udp ports */
+		atomic64_set(&netcg->udp_stats.udp_limitandusage,
+			     set_udp_limit_usage(DEFAULT_UDP_LIMIT, 0));
 	} else {
 		/* if not root, then, inherit ranges from parent */
 		if (alloc_copy_net_ranges(
@@ -154,6 +183,11 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 			/* if any of these cause an error, return ENOMEM */
 			return ERR_PTR(-ENOMEM);
 		}
+		/* and inherit udp port limit from parent */
+		parent_udp_limit = get_udp_limit(atomic64_read(
+			&parent_netcg->udp_stats.udp_limitandusage));
+		atomic64_set(&netcg->udp_stats.udp_limitandusage,
+			     set_udp_limit_usage(parent_udp_limit, 0));
 	}
 
 	return &netcg->css;
@@ -203,6 +237,212 @@ bool net_cgroup_listen_allowed(u16 port)
 }
 EXPORT_SYMBOL_GPL(net_cgroup_listen_allowed);
 
+static s64 net_udp_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct  net_cgroup *netcg = css_to_net_cgroup(css);
+	s32 value = 0;
+
+	switch (cft->private) {
+	case NETCG_LIMIT_UDP:
+		value = get_udp_limit(
+			atomic64_read(&netcg->udp_stats.udp_limitandusage));
+		break;
+	case NETCG_USAGE_UDP:
+		value = get_udp_usage(
+			atomic64_read(&netcg->udp_stats.udp_limitandusage));
+		break;
+	case NETCG_MAXUSAGE_UDP:
+		value = atomic_read(&netcg->udp_stats.udp_maxusage);
+		break;
+	case NETCG_FAILCNT_UDP:
+		value = atomic_read(&netcg->udp_stats.udp_failcnt);
+		break;
+	case NETCG_UNDERFLOWCNT_UDP:
+		value = atomic_read(&netcg->udp_stats.udp_underflowcnt);
+		break;
+	default:
+		value = 0; /* this shouldn't happen */
+		break;
+	}
+
+	return (s64)value;
+}
+
+static int net_udp_write_s64(struct cgroup_subsys_state *css,
+			     struct cftype *cft, s64 val)
+{
+	struct net_cgroup *netcg = css_to_net_cgroup(css);
+	s32 oldlimit, usage;
+	u64     limitandusage, oldbits, newbits;
+
+	/* Make sure that 'val' is a 32 bit int */
+	if (val != (s32)val || val < -1)
+		return -EINVAL;
+
+	limitandusage = atomic64_read(&netcg->udp_stats.udp_limitandusage);
+	for (;;) {
+		oldlimit = get_udp_limit(limitandusage);
+		usage = get_udp_usage(limitandusage);
+		if (unlikely(oldlimit == (s32)val))
+			break;
+		newbits = set_udp_limit_usage((s32)val, usage);
+		oldbits = atomic64_cmpxchg(&netcg->udp_stats.udp_limitandusage,
+					   limitandusage, newbits);
+		if (likely(oldbits == limitandusage))
+			break;
+		limitandusage = oldbits;
+	}
+
+	return 0;
+}
+
+static bool try_inc_udp_usage(struct net_cgroup *netcg)
+{
+	s32 limit, usage, oldusage, maxusage;
+	u64 limitandusage, newbits, oldbits;
+
+	limitandusage = atomic64_read(&netcg->udp_stats.udp_limitandusage);
+	for (;;) {
+		usage = get_udp_usage(limitandusage);
+		limit = get_udp_limit(limitandusage);
+		/* Default indicates no restriction. */
+		if ((limit != DEFAULT_UDP_LIMIT) && unlikely(usage >= limit)) {
+			atomic_inc(&netcg->udp_stats.udp_failcnt);
+			return false;
+		}
+		/* Increment the usage irrespective of the fact that there is
+		 * limit set or not to record the usage.
+		 */
+		++usage;
+		newbits = set_udp_usage(limitandusage, usage);
+		oldbits = atomic64_cmpxchg(&netcg->udp_stats.udp_limitandusage,
+					   limitandusage, newbits);
+		if (likely(oldbits == limitandusage))
+			break;
+		limitandusage = oldbits;
+	}
+
+	maxusage = atomic_read(&netcg->udp_stats.udp_maxusage);
+
+	while (usage > maxusage) {
+		oldusage = atomic_cmpxchg(&netcg->udp_stats.udp_maxusage,
+					  maxusage, usage);
+		if (likely(oldusage == maxusage))
+			break;
+		maxusage = oldusage;
+	}
+	return true;
+}
+
+static bool try_dec_udp_usage(struct net_cgroup *netcg)
+{
+	s32 usage;
+	u64 limitandusage, newbits, oldbits;
+
+	limitandusage = atomic64_read(&netcg->udp_stats.udp_limitandusage);
+	for (;;) {
+		usage = get_udp_usage(limitandusage);
+		if (unlikely(usage <= 0)) {
+			atomic_inc(&netcg->udp_stats.udp_underflowcnt);
+			return false;
+		}
+		--usage;
+		newbits = set_udp_usage(limitandusage, usage);
+		oldbits = atomic64_cmpxchg(&netcg->udp_stats.udp_limitandusage,
+					   limitandusage, newbits);
+		if (likely(oldbits == limitandusage))
+			break;
+		limitandusage = oldbits;
+	}
+
+	return true;
+}
+
+/* The feature exposes following values through the cgroup interface:
+ * (1) udp_limit: Maximum number of UDP ports that processes from this
+ *                container can use.
+ * (2) udp_usage: Current usage of UDP ports by processes in this container.
+ * (3) udp_maxusage: The peak usage of UDP ports since container creation.
+ * (4) udp_failcnt: Number of port allocation requests failed because of
+ *                  ports depletion for this container.
+ * (5) udp_underflowcnt: Number of port release requests that would have
+ *                       pushed the usage below zero (see description below).
+ *
+ * Caveats:
+ *    If a process is moved to a different container, the udp sockets are not
+ * accounted for that process in the destination container. When that process
+ * finishes; that much credit will not be returned to the source container.
+ * This will create some sort of discrepancy at both source and destination
+ * containers. This transfer would create a pseudo-permanent transfer of port
+ * credits to the destination container. The trasfer effect will be nullified
+ * only if all the processes in this destination container stop using the udp
+ * ports momentarily and total usage drops to ZERO (since we do not allow the
+ * usage count to go negative, the pseudo-permanent transfer gets nullified
+ * at the destination container). It's assumed that this kind of process
+ * migration is minimal.
+ *    The limitandusage (64 bit) field holds limit in upper 32 bits and current
+ * usage in lower 32 bits. Since this is one field, the check and update atomic
+ * operations eliminate possible mismatch on multi-core systems.
+ */
+bool net_cgroup_acquire_udp_port(void)
+{
+	struct net_cgroup *netcg;
+	struct net_cgroup *curr;
+	struct net_cgroup *curr2;
+	bool success = true;
+
+	rcu_read_lock();
+	netcg = task_to_net_cgroup(current);
+
+	/* iterate this net_cgroup and its ancestors, attempting increment the
+	 * usage at each step
+	 */
+	for (curr = netcg;
+	     net_cgroup_to_parent(curr);
+	     curr = net_cgroup_to_parent(curr)) {
+		if (!try_inc_udp_usage(curr)) {
+			/* get out if any one ancestor fails */
+			success = false;
+			break;
+		}
+	}
+
+	if (!success) {
+		/* one of the ancestors failed to increment its usage. now, we
+		 * need to undo all the increments we did
+		 */
+		for (curr2 = netcg;
+		     curr2 != curr;
+		     curr2 = net_cgroup_to_parent(curr2)) {
+			try_dec_udp_usage(curr2);
+		}
+	}
+
+	rcu_read_unlock();
+	return success;
+}
+EXPORT_SYMBOL_GPL(net_cgroup_acquire_udp_port);
+
+void net_cgroup_release_udp_port(void)
+{
+	struct net_cgroup *netcg;
+	struct net_cgroup *curr;
+
+	rcu_read_lock();
+	netcg = task_to_net_cgroup(current);
+
+	/* iterate this net_cgroup and its ancestors, attempting decrement the
+	 * usage at each step
+	 */
+	for (curr = netcg;
+	     net_cgroup_to_parent(curr);
+	     curr = net_cgroup_to_parent(curr)) {
+		try_dec_udp_usage(curr);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(net_cgroup_release_udp_port);
+
 /* Returns true if the range r is a subset of at least one of the ranges in
  * rs, and returns false otherwise.
  */
@@ -393,6 +633,39 @@ static struct cftype ss_files[] = {
 		.private	= NETCG_BIND_RANGES,
 		.max_write_len	= MAX_WRITE_SIZE,
 	},
+	{
+		.name		= "udp_limit",
+		.flags		= CFTYPE_ONLY_ON_ROOT,
+		.read_s64	= net_udp_read_s64,
+		.private	= NETCG_LIMIT_UDP,
+	},
+	{
+		.name		= "udp_limit",
+		.flags		= CFTYPE_NOT_ON_ROOT,
+		.read_s64	= net_udp_read_s64,
+		.write_s64	= net_udp_write_s64,
+		.private	= NETCG_LIMIT_UDP,
+	},
+	{
+		.name		= "udp_usage",
+		.read_s64	= net_udp_read_s64,
+		.private	= NETCG_USAGE_UDP,
+	},
+	{
+		.name		= "udp_maxusage",
+		.read_s64	= net_udp_read_s64,
+		.private	= NETCG_MAXUSAGE_UDP,
+	},
+	{
+		.name		= "udp_failcnt",
+		.read_s64	= net_udp_read_s64,
+		.private	= NETCG_FAILCNT_UDP,
+	},
+	{
+		.name		= "udp_underflowcnt",
+		.read_s64	= net_udp_read_s64,
+		.private	= NETCG_UNDERFLOWCNT_UDP,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e61f7cd..fe588cf 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -114,6 +114,7 @@
 #include <net/busy_poll.h>
 #include "udp_impl.h"
 #include <net/sock_reuseport.h>
+#include <net/net_cgroup.h>
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -318,6 +319,11 @@ scan_primary_hash:
 			goto fail_unlock;
 	}
 found:
+	/* Make sure it's UDP (and not UDP-lite) and it's unhashed! */
+	if (sk->sk_protocol == IPPROTO_UDP && sk_unhashed(sk) &&
+	    !net_cgroup_acquire_udp_port())
+		goto fail_unlock;
+
 	inet_sk(sk)->inet_num = snum;
 	udp_sk(sk)->udp_port_hash = snum;
 	udp_sk(sk)->udp_portaddr_hash ^= snum;
@@ -1388,6 +1394,8 @@ void udp_lib_unhash(struct sock *sk)
 		if (rcu_access_pointer(sk->sk_reuseport_cb))
 			reuseport_detach_sock(sk);
 		if (sk_del_node_init_rcu(sk)) {
+			if (sk->sk_protocol == IPPROTO_UDP)
+				net_cgroup_release_udp_port();
 			hslot->count--;
 			inet_sk(sk)->inet_num = 0;
 			sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] net: add dscp ranges to net cgroup
  2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
                   ` (2 preceding siblings ...)
       [not found] ` <1470876798-4024-1-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2016-08-11  0:53 ` Anoop Naravaram
  2016-08-11  0:53 ` [PATCH 5/5] net: add test for " Anoop Naravaram
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev
  Cc: edumazet, maheshb, weiwan, tom, Anoop Naravaram

dscp ranges
----------
This property controls which dscp values the processes in a cgroup are
allowed to use. A process in a cgroup will receive an EACCES error if it
tries to do any of these things:
* set a socket's IP_TOS option to a value whose dscp field (bits 7:2) is
  outside the range
* use a socket to send a message in which the IP_TOS ancillary data is
  set to a value whose dscp field is outside the range

This property is exposed to userspace through the 'net.dscp_ranges' file,
similar to the bind and listen port ranges.

Tested: wrote python to attempt to setsockopt the IP_TOS option to a
value with an out-of-range dscp field, and expect a failure

Signed-off-by: Anoop Naravaram <anaravaram@google.com>
---
 Documentation/cgroup-v1/net.txt | 14 ++++++++++++++
 include/net/net_cgroup.h        |  6 ++++++
 net/core/net_cgroup.c           | 34 ++++++++++++++++++++++++++++++++--
 net/ipv4/ip_sockglue.c          | 13 +++++++++++++
 net/ipv6/datagram.c             |  9 +++++++++
 net/ipv6/ipv6_sockglue.c        |  8 ++++++++
 6 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v1/net.txt b/Documentation/cgroup-v1/net.txt
index a14fd1c..ea2f1db 100644
--- a/Documentation/cgroup-v1/net.txt
+++ b/Documentation/cgroup-v1/net.txt
@@ -30,6 +30,20 @@ This property is exposed to userspace through the 'net.listen_port_ranges' file,
 as ranges of ports that the processes can listen on (as described in the HOW TO
 INTERACT WITH RANGES FILES section).
 
+dscp ranges
+-----------
+This property controls which dscp values the processes in a cgroup are
+allowed to use. A process in a cgroup will receive an EACCES error if it
+tries to do any of these things:
+* set a socket's IP_TOS option to a value whose dscp field (bits 7:2) is
+  outside the range
+* use a socket to send a message in which the IP_TOS ancillary data is
+  set to a value whose dscp field is outside the range
+
+This property is exposed to userspace through the 'net.dscp_ranges' file, as
+ranges of dscp values that the process can use (as described in the HOW TO
+INTERACT WITH RANGES FILES section).
+
 udp port usage and limit
 ------------------------
 This property controls the limit of udp ports that can be used by the
diff --git a/include/net/net_cgroup.h b/include/net/net_cgroup.h
index 25a9def..d89e98d 100644
--- a/include/net/net_cgroup.h
+++ b/include/net/net_cgroup.h
@@ -23,6 +23,7 @@
 enum {
 	NETCG_LISTEN_RANGES,
 	NETCG_BIND_RANGES,
+	NETCG_DSCP_RANGES,
 	NETCG_NUM_RANGE_TYPES
 };
 
@@ -73,6 +74,7 @@ struct net_cgroup {
 
 bool net_cgroup_bind_allowed(u16 port);
 bool net_cgroup_listen_allowed(u16 port);
+bool net_cgroup_dscp_allowed(u8 dscp);
 bool net_cgroup_acquire_udp_port(void);
 void net_cgroup_release_udp_port(void);
 
@@ -85,6 +87,10 @@ static inline bool net_cgroup_listen_allowed(u16 port)
 {
 	return true;
 }
+static inline bool net_cgroup_dscp_allowed(u8 dscp)
+{
+	return true;
+}
 static inline bool net_cgroup_acquire_udp_port(void)
 {
 	return true;
diff --git a/net/core/net_cgroup.c b/net/core/net_cgroup.c
index 2f58e13..73dc5e7 100644
--- a/net/core/net_cgroup.c
+++ b/net/core/net_cgroup.c
@@ -21,6 +21,9 @@
 #define MIN_PORT_VALUE		0
 #define MAX_PORT_VALUE		65535
 
+#define MIN_DSCP_VALUE		0
+#define MAX_DSCP_VALUE		63
+
 /* Deriving MAX_ENTRIES from MAX_WRITE_SIZE as a rough estimate */
 #define MAX_ENTRIES ((MAX_WRITE_SIZE - offsetof(struct net_ranges, range)) /   \
 		     BYTES_PER_ENTRY)
@@ -161,7 +164,10 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 				MIN_PORT_VALUE, MAX_PORT_VALUE) ||
 		    alloc_init_net_ranges(
 				&netcg->whitelists[NETCG_LISTEN_RANGES],
-				MIN_PORT_VALUE, MAX_PORT_VALUE)) {
+				MIN_PORT_VALUE, MAX_PORT_VALUE) ||
+		    alloc_init_net_ranges(
+				&netcg->whitelists[NETCG_DSCP_RANGES],
+				MIN_DSCP_VALUE, MAX_DSCP_VALUE)) {
 			free_net_cgroup(netcg);
 			/* if any of these cause an error, return ENOMEM */
 			return ERR_PTR(-ENOMEM);
@@ -178,7 +184,11 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 		    alloc_copy_net_ranges(
 				&netcg->whitelists[NETCG_LISTEN_RANGES],
 				MIN_PORT_VALUE, MAX_PORT_VALUE,
-				&parent_netcg->whitelists[NETCG_LISTEN_RANGES])) {
+				&parent_netcg->whitelists[NETCG_LISTEN_RANGES]) ||
+		    alloc_copy_net_ranges(
+				&netcg->whitelists[NETCG_DSCP_RANGES],
+				MIN_DSCP_VALUE, MAX_DSCP_VALUE,
+				&parent_netcg->whitelists[NETCG_DSCP_RANGES])) {
 			free_net_cgroup(netcg);
 			/* if any of these cause an error, return ENOMEM */
 			return ERR_PTR(-ENOMEM);
@@ -237,6 +247,12 @@ bool net_cgroup_listen_allowed(u16 port)
 }
 EXPORT_SYMBOL_GPL(net_cgroup_listen_allowed);
 
+bool net_cgroup_dscp_allowed(u8 dscp)
+{
+	return net_cgroup_value_allowed(dscp, NETCG_DSCP_RANGES);
+}
+EXPORT_SYMBOL_GPL(net_cgroup_dscp_allowed);
+
 static s64 net_udp_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct  net_cgroup *netcg = css_to_net_cgroup(css);
@@ -634,6 +650,20 @@ static struct cftype ss_files[] = {
 		.max_write_len	= MAX_WRITE_SIZE,
 	},
 	{
+		.name		= "dscp_ranges",
+		.flags		= CFTYPE_ONLY_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.private	= NETCG_DSCP_RANGES,
+	},
+	{
+		.name		= "dscp_ranges",
+		.flags		= CFTYPE_NOT_ON_ROOT,
+		.seq_show	= net_read_ranges,
+		.write		= net_write_ranges,
+		.private	= NETCG_DSCP_RANGES,
+		.max_write_len	= MAX_WRITE_SIZE,
+	},
+	{
 		.name		= "udp_limit",
 		.flags		= CFTYPE_ONLY_ON_ROOT,
 		.read_s64	= net_udp_read_s64,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 71a52f4d..71a4297 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -42,6 +42,7 @@
 #include <net/transp_v6.h>
 #endif
 #include <net/ip_fib.h>
+#include <net/net_cgroup.h>
 
 #include <linux/errqueue.h>
 #include <asm/uaccess.h>
@@ -289,6 +290,11 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 			val = *(int *)CMSG_DATA(cmsg);
 			if (val < 0 || val > 255)
 				return -EINVAL;
+			/* val is 8-bit tos, we need to rightshift 2 to get the
+			 * 6-bit dscp field
+			 */
+			if (!net_cgroup_dscp_allowed(val >> 2))
+				return -EACCES;
 			ipc->tos = val;
 			ipc->priority = rt_tos2priority(ipc->tos);
 			break;
@@ -727,6 +733,13 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 			val &= ~INET_ECN_MASK;
 			val |= inet->tos & INET_ECN_MASK;
 		}
+		/* val is 8-bit tos, we need to rightshift 2 to get the
+		 * 6-bit dscp field
+		 */
+		if (!net_cgroup_dscp_allowed(val >> 2)) {
+			err = -EACCES;
+			break;
+		}
 		if (inet->tos != val) {
 			inet->tos = val;
 			sk->sk_priority = rt_tos2priority(val);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 37874e2..9053b83 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -34,6 +34,7 @@
 
 #include <linux/errqueue.h>
 #include <asm/uaccess.h>
+#include <net/net_cgroup.h>
 
 static bool ipv6_mapped_addr_any(const struct in6_addr *a)
 {
@@ -973,6 +974,14 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 			if (tc < -1 || tc > 0xff)
 				goto exit_f;
 
+			/* tc is 8-bit tclass, we need to rightshift 2 to get
+			 * the 6-bit dscp field
+			 */
+			if (!net_cgroup_dscp_allowed(tc >> 2)) {
+				err = -EACCES;
+				goto exit_f;
+			}
+
 			err = 0;
 			ipc6->tclass = tc;
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index a9895e1..eac3f88 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -52,6 +52,7 @@
 #include <net/udplite.h>
 #include <net/xfrm.h>
 #include <net/compat.h>
+#include <net/net_cgroup.h>
 
 #include <asm/uaccess.h>
 
@@ -339,6 +340,13 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		/* RFC 3542, 6.5: default traffic class of 0x0 */
 		if (val == -1)
 			val = 0;
+		/* val is 8-bit tclass, we need to rightshift 2 to get the 6-bit
+		 * dscp field
+		 */
+		if (!net_cgroup_dscp_allowed(val >> 2)) {
+			retv = -EACCES;
+			break;
+		}
 		np->tclass = val;
 		retv = 0;
 		break;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] net: add test for net cgroup
  2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
                   ` (3 preceding siblings ...)
  2016-08-11  0:53 ` [PATCH 4/5] net: add dscp ranges to net cgroup Anoop Naravaram
@ 2016-08-11  0:53 ` Anoop Naravaram
  2016-08-13  4:38   ` Alexei Starovoitov
  2016-08-23  8:49 ` [PATCH 0/5] Networking cgroup controller Parav Pandit
  2016-08-24 21:03 ` Tejun Heo
  6 siblings, 1 reply; 25+ messages in thread
From: Anoop Naravaram @ 2016-08-11  0:53 UTC (permalink / raw)
  To: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev
  Cc: edumazet, maheshb, weiwan, tom, Anoop Naravaram

Created a file scripts/cgroup/net_cgroup_test.py that tests the
functionality of the net cgroup as described in previous commit logs.

Signed-off-by: Anoop Naravaram <anaravaram@google.com>
---
 scripts/cgroup/net_cgroup_test.py | 359 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 359 insertions(+)
 create mode 100755 scripts/cgroup/net_cgroup_test.py

diff --git a/scripts/cgroup/net_cgroup_test.py b/scripts/cgroup/net_cgroup_test.py
new file mode 100755
index 0000000..604f662
--- /dev/null
+++ b/scripts/cgroup/net_cgroup_test.py
@@ -0,0 +1,359 @@
+#!/usr/grte/v4/bin/python2.7
+import unittest
+import os
+import socket
+import shutil
+import multiprocessing
+
+cgroup_net_root = '/dev/cgroup/net'
+
+def create_cgroup(name):
+    '''
+    Creates a cgroup with the given name. The name should also include the names
+    of all ancestors separated by slashes, such as 'a/b/c'. Returns a path to
+    the directory of the newly created cgroup.
+    '''
+    cgroup_dir = os.path.join(cgroup_net_root, name)
+    while True:
+        try:
+            os.mkdir(cgroup_dir)
+            break
+        except OSError as e:
+            # remove it if it already exists, then try to create again
+            # there will be errors when rmtree tries to remove the cgroup files,
+            # but these errors should be ignored because we only care about
+            # rmdir'ing the directories, which will automatically get rid of the
+            # files inside them
+            shutil.rmtree(cgroup_dir, ignore_errors=True)
+
+    return cgroup_dir
+
+
+def parse_ranges(ranges_str):
+    '''
+    Converts a range string like "100-200,300-400" into a set of 2-tuples like
+    {(100,200),(300,400)}.
+    '''
+    return set(tuple(int(l) for l in r.strip().split('-'))
+               for r in ranges_str.split(','))
+
+def acquire_udp_ports(cgroup_dir, e2, n, addr, numfailq):
+    '''
+    Waits for the event e1, attempts to acquire n udp ports connected to addr,
+    and then puts the number of failures on the queue and waits for e2. Then,
+    all sockets are closed. (Intended to be called as a subprocess.)
+
+    While waiting for e1, the parent process can set this process's cgroup.
+    While waiting for e2, the parent process can read the udp statistics while
+    this process is still alive.
+    '''
+
+    with open(os.path.join(cgroup_dir, 'tasks'), 'w') as f:
+        # add proc1 to cgroup a/b
+        f.write(str(os.getpid()))
+
+    socketset = set()
+    numfail = 0
+    for _ in xrange(n):
+        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+        try:
+            s.connect(addr)
+        except socket.error as e:
+            numfail += 1
+        socketset.add(s)
+
+    if numfailq is not None:
+        numfailq.put(numfail)
+
+    if e2 is not None:
+        e2.wait()
+
+    for s in socketset:
+        s.close()
+
+
+class NetCgroupTest(unittest.TestCase):
+
+    def test_port_range_check(self):
+        '''
+        Test that the kernel is correctly checking that a port is in the
+        relevant range when a process in a net cgroup is trying to bind a socket
+        to it, or trying to listen on a socket bound to it.
+        '''
+
+        # create a new cgroup a
+        cgroup_a_dir = create_cgroup('a')
+
+        # add current process to cgroup a
+        with open(os.path.join(cgroup_a_dir, 'tasks'), 'w') as f:
+            f.write(str(os.getpid()))
+
+        # set bind and listen range of cgroup a
+        with open(os.path.join(cgroup_a_dir, 'net.bind_port_ranges'), 'w') as f:
+            f.write('300-400,500')
+        with open(os.path.join(cgroup_a_dir, 'net.listen_port_ranges'), 'w') as f:
+            f.write('350-400')
+
+        # try binding and listening on various ports, and check if they succeed
+        # or fail appropriately
+        s = socket.socket()
+        try:
+            s.bind(('0.0.0.0', 350)) # should bind and listen successfully
+            try:
+                s.listen(5)
+            except socket.error:
+                self.fail('unexpectedly failed to listen')
+        except socket.error:
+            self.fail('unexpectedly failed to bind')
+        s.close()
+
+        s = socket.socket()
+        try:
+            s.bind(('0.0.0.0', 370)) # should bind and listen successfully
+            try:
+                s.listen(5)
+            except socket.error:
+                self.fail('unexpectedly failed to listen')
+        except socket.error:
+            self.fail('unexpectedly failed to bind')
+        s.close()
+
+        s = socket.socket()
+        try:
+            s.bind(('0.0.0.0', 320)) # should bind successfully but fail to listen
+            with self.assertRaises(socket.error):
+                s.listen(5)
+        except socket.error:
+            self.fail('unexpectedly failed to bind')
+        s.close()
+
+        s = socket.socket()
+        try:
+            s.bind(('0.0.0.0', 500)) # should bind successfully but fail to listen
+            with self.assertRaises(socket.error):
+                s.listen(5)
+        except socket.error:
+            self.fail('unexpectedly failed to bind')
+        s.close()
+
+        s = socket.socket()
+        with self.assertRaises(socket.error):
+            s.bind(('0.0.0.0', 200)) # should fail to bind
+        s.close()
+
+        s = socket.socket()
+        with self.assertRaises(socket.error):
+            s.bind(('0.0.0.0', 401)) # should fail to bind
+        s.close()
+
+        # remove current process from cgroup a (by adding it to root)
+        with open(os.path.join(cgroup_net_root, 'tasks'), 'w') as f:
+            f.write(str(os.getpid()))
+
+
+    def test_range_inheritance(self):
+        '''
+        Test that the kernel copies the ranges from parent when a net cgroup is
+        created.
+        '''
+        cgroup_a_dir = create_cgroup('a')
+
+        # set ranges of parent
+        with open(os.path.join(cgroup_a_dir, 'net.bind_port_ranges'), 'w') as f:
+            f.write('100-200,300-400,500')
+        with open(os.path.join(cgroup_a_dir, 'net.listen_port_ranges'), 'w') as f:
+            f.write('150,300')
+
+        cgroup_b_dir = create_cgroup('a/b')
+
+        # check that bind range is the same in both a and a/b
+        with open(os.path.join(cgroup_a_dir, 'net.bind_port_ranges')) as fa, \
+             open(os.path.join(cgroup_b_dir, 'net.bind_port_ranges')) as fb:
+            ranges_a_str = fa.read()
+            ranges_b_str = fb.read()
+            ranges_a = parse_ranges(ranges_a_str)
+            ranges_b = parse_ranges(ranges_b_str)
+            self.assertEqual(ranges_a, ranges_b)
+
+        # check that listen range is the same in both a and a/b
+        with open(os.path.join(cgroup_a_dir, 'net.listen_port_ranges')) as fa, \
+             open(os.path.join(cgroup_b_dir, 'net.listen_port_ranges')) as fb:
+            ranges_a_str = fa.read()
+            ranges_b_str = fb.read()
+            ranges_a = parse_ranges(ranges_a_str)
+            ranges_b = parse_ranges(ranges_b_str)
+            self.assertEqual(ranges_a, ranges_b)
+
+    def test_enforce_subset(self):
+        '''
+        Test that the kernel enforces the rule that a cgroup cannot have a port
+        within its range that isn't in its parent's range.
+        '''
+        cgroup_a_dir = create_cgroup('a')
+
+        # set ranges of parent
+        with open(os.path.join(cgroup_a_dir, 'net.bind_port_ranges'), 'w') as f:
+            f.write('100-200,300-400,500')
+        with open(os.path.join(cgroup_a_dir, 'net.listen_port_ranges'), 'w') as f:
+            f.write('150,300')
+
+        cgroup_b_dir = create_cgroup('a/b')
+
+        # try to set a/b ranges to various things
+
+        with open(os.path.join(cgroup_b_dir, 'net.bind_port_ranges'), 'w') as f:
+            try:
+                f.write('130-160,500') # should succeed
+            except IOError as e:
+                self.fail('unexpectedly failed to set ranges')
+        with open(os.path.join(cgroup_b_dir, 'net.listen_port_ranges'), 'w') as f:
+            try:
+                f.write('150') # should succeed
+            except IOError as e:
+                self.fail('unexpectedly failed to set ranges')
+
+        with open(os.path.join(cgroup_b_dir, 'net.bind_port_ranges'), 'w') as f:
+            with self.assertRaises(IOError):
+                f.write('200-300,350,360-370') # should fail
+        with open(os.path.join(cgroup_b_dir, 'net.listen_port_ranges'), 'w') as f:
+            with self.assertRaises(IOError):
+                f.write('200-300') # should fail
+
+        with open(os.path.join(cgroup_b_dir, 'net.bind_port_ranges'), 'w') as f:
+            with self.assertRaises(IOError):
+                f.write('210,220,230-240') # should fail
+        with open(os.path.join(cgroup_b_dir, 'net.listen_port_ranges'), 'w') as f:
+            with self.assertRaises(IOError):
+                f.write('210,220,230-240') # should fail
+
+    def test_udp_usage(self):
+        '''
+        Tests if the kernel counts udp usage in a hierarchical manner.
+        '''
+
+        # create a new cgroups a, a/b, a/c
+        cgroup_a_dir = create_cgroup('a')
+        cgroup_b_dir = create_cgroup('a/b')
+        cgroup_c_dir = create_cgroup('a/c')
+
+        # event for synchronizing subprocesses
+        e3 = multiprocessing.Event()
+
+        # create a server socket so that subprocesses can connect to it
+        addr = ('0.0.0.0', 3000)
+        serversocket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+        serversocket.bind(addr)
+
+        numfailq1 = multiprocessing.Queue()
+        numfailq2 = multiprocessing.Queue()
+
+        # create 2 subprocesses, one going into a/b, other going into a/c
+        proc1 = multiprocessing.Process(name='proc1', target=acquire_udp_ports,
+                                        args=(cgroup_b_dir, e3, 4, addr, numfailq1))
+        proc2 = multiprocessing.Process(name='proc2', target=acquire_udp_ports,
+                                        args=(cgroup_c_dir, e3, 3, addr, numfailq2))
+
+        # proc1 will acquire 4 ports in cgroup a/b
+        proc1.start()
+        # make sure none failed
+        self.assertEqual(numfailq1.get(), 0)
+
+        # proc2 will acquire 3 ports in cgroup a/c
+        proc2.start()
+        # make sure none failed
+        self.assertEqual(numfailq2.get(), 0)
+
+        # check if the usage count is correct
+        with open(os.path.join(cgroup_a_dir, 'net.udp_usage')) as f:
+            # a/net.udp_usage should be 7, because it counts its children's too
+            self.assertEqual(int(f.read()), 7)
+        with open(os.path.join(cgroup_b_dir, 'net.udp_usage')) as f:
+            self.assertEqual(int(f.read()), 4)
+        with open(os.path.join(cgroup_c_dir, 'net.udp_usage')) as f:
+            self.assertEqual(int(f.read()), 3)
+
+        # signal the subprocesses that they can close their sockets now
+        e3.set()
+        serversocket.close()
+
+    def test_udp_limit(self):
+        '''
+        Test if the kernel only allows processes to use a limited number of udp
+        ports.
+        '''
+        # create a new cgroup a
+        cgroup_a_dir = create_cgroup('a')
+
+        # event for synchronizing subprocess
+        e2 = multiprocessing.Event()
+
+        # create a server socket so that subprocesses can connect to it
+        addr = ('0.0.0.0', 3000)
+        serversocket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+        serversocket.bind(addr)
+
+        # queue for getting fail count
+        numfailq = multiprocessing.Queue()
+
+        # create and start a subprocess
+        proc = multiprocessing.Process(name='proc', target=acquire_udp_ports,
+                                       args=(cgroup_a_dir, e2, 8, addr, numfailq))
+
+        # set the udp limit to 5
+        with open(os.path.join(cgroup_a_dir, 'net.udp_limit'), 'w') as f:
+            f.write('5')
+
+        # proc will attempt to acquire 8 udp port (only 5 will be successful)
+        proc.start()
+        # make sure that there were indeed 3 failures
+        self.assertEqual(numfailq.get(), 3)
+
+        # check if the usage count and fail count in the files is correct
+        with open(os.path.join(cgroup_a_dir, 'net.udp_usage')) as f:
+            self.assertEqual(int(f.read()), 5)
+        with open(os.path.join(cgroup_a_dir, 'net.udp_failcnt')) as f:
+            self.assertEqual(int(f.read()), 3)
+
+        # signal the subprocess that it can close its sockets now
+        e2.set()
+        serversocket.close()
+
+    def test_dscp_range_check(self):
+        # create a new cgroup a
+        cgroup_a_dir = create_cgroup('a')
+
+        # add current process to cgroup a
+        with open(os.path.join(cgroup_a_dir, 'tasks'), 'w') as f:
+            f.write(str(os.getpid()))
+
+        # set dscp range of cgroup a
+        with open(os.path.join(cgroup_a_dir, 'net.dscp_ranges'), 'w') as f:
+            f.write('20-40')
+
+        # try setting the IP_TOS option to various things
+        s = socket.socket()
+        try:
+            # dscp = 30 (in range)
+            s.setsockopt(socket.IPPROTO_IP, socket.IP_TOS, 30 << 2 | 1)
+        except socket.error:
+            self.fail('unexpectedly failed to setsockopt')
+        s.close()
+
+        s = socket.socket()
+        with self.assertRaises(socket.error):
+            # dscp = 50 (out of range)
+            s.setsockopt(socket.IPPROTO_IP, socket.IP_TOS, 50 << 2 | 2)
+        s.close()
+
+        s = socket.socket()
+        with self.assertRaises(socket.error):
+            # dscp = 10 (out of range)
+            s.setsockopt(socket.IPPROTO_IP, socket.IP_TOS, 10 << 2 | 3)
+        s.close()
+
+        # remove current process from cgroup a (by adding it to root)
+        with open(os.path.join(cgroup_net_root, 'tasks'), 'w') as f:
+            f.write(str(os.getpid()))
+
+if __name__ == '__main__':
+    unittest.main()
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] net: add bind/listen ranges to net cgroup
  2016-08-11  0:53   ` Anoop Naravaram
  (?)
@ 2016-08-13  4:29   ` Alexei Starovoitov
  2016-08-15 19:05     ` Mahesh Bandewar
  -1 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2016-08-13  4:29 UTC (permalink / raw)
  To: Anoop Naravaram
  Cc: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev, edumazet, maheshb, weiwan,
	tom

On Wed, Aug 10, 2016 at 05:53:15PM -0700, Anoop Naravaram wrote:
> bind port ranges
> ----------------
> This property controls which ports the processes in a cgroup are allowed
> to bind to. If a process in a cgroup tries to bind a socket to a port
> that is not within the range(s) permitted by the cgroup, it will receive an
> EACCES error.
> 
> From userspace, you can get or set the bind port ranges by accessing the
> 'net.bind_port_ranges' file. To set the ranges of a cgroup, write the
> comma-separated ranges to the file, where each range could be either a
> pair of ports separated by a hyphen (-), or just an individual port. For
> example, say you want to allow all the processes in a cgroup to be allowed
> to bind to ports 100 through 200 (inclusive), 300 through 320 (inclusive)
> and 350. Then you can write the string "100-200,300-320,350" to the
> 'net.bind_port_ranges' file. When reading the file, any individual ports
> will be read as a "start-end" range where the start and end are equal.
> The example above would be read as "100-200,300-320,350-350".
...
> +/* Returns true if the range r is a subset of at least one of the ranges in
> + * rs, and returns false otherwise.
> + */
> +static bool range_in_ranges(struct net_range *r, struct net_ranges *rs)
> +{
> +	int ri;
> +
> +	for (ri = 0; ri < rs->num_entries; ri++)
> +		if (r->min_value >= rs->range[ri].min_value &&
> +		    r->max_value <= rs->range[ri].max_value)
> +			return true;
> +
> +	return false;
> +}
> +
> +/* Returns true if all the ranges in rs1 are subsets of at least one of the
> + * ranges in rs2, ans returns false otherwise.
> + */
> +static bool ranges_in_ranges(struct net_ranges *rs1, struct net_ranges *rs2)
> +{
> +	int ri;
> +
> +	for (ri = 0; ri < rs1->num_entries; ri++)
> +		if (!range_in_ranges(&rs1->range[ri], rs2))
> +			return false;
> +
> +	return true;
> +}

This is not a scalable implementation. The next step would be to
do a binary search, hash table or something else here?
I think the root of the problem is in hard coded checks
that quickly become inefficient and inadequate as applicability
of the feature grows.
We already have BPF that can suite this purpose much better
without bloating the kernel code with parsing and matching logic.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] net: add udp limit to net cgroup
       [not found]     ` <1470876798-4024-4-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2016-08-13  4:35       ` Alexei Starovoitov
  2016-08-15 19:16           ` Mahesh Bandewar
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2016-08-13  4:35 UTC (permalink / raw)
  To: Anoop Naravaram
  Cc: corbet-T1hC0tSOHrs, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, maheshb-hpIqsD4AKlfQT0dZR+AlfA,
	weiwan-hpIqsD4AKlfQT0dZR+AlfA, tom-BjP2VixgY4xUbtYUoyoikg

On Wed, Aug 10, 2016 at 05:53:16PM -0700, Anoop Naravaram wrote:
> udp port limit
> --------------
> This property controls the limit of udp ports that can be used by the
> processes in a cgroup. The controller manages udp statistics (usage,
> limit, etc) for each cgroup. Every cgroup also keeps track of the udp
> ports acquired by its descendants. If a process tries to acquire a port
> when its cgroup has already reached its limit, it will fail with error
> EACCES. It will also fail if one of the cgroup's ancestors has reached its
> limit. There are 5 files exposed to userspace to configure this property:
> 
> * 'net.udp_usage': Reading this file gives the number of udp ports used by
> processes in this cgroup and all its descendants.
> * 'net.udp_limit': Writing this file sets the total number of udp ports
> that can be used by processes in this cgroup and all
> its descendants. This file can also be read.
> * 'net.udp_maxusage': Reading this file gives the highest value of
> net.udp_usage that has been seen for this cgroup.
> * 'net.udp_failcnt': Reading this file gives the number of times a
> process in this cgroup or one of its descendants has attempted to acquire
> a udp port but failed because the limit of this cgroup was reached.
> * 'net.udp_underflowcnt': Reading this file gives the number of times a
> process in this cgroup or one of its descendants released a udp port when
> the usage value of this cgroup was 0.

I have similar concern here. I don't think we should bloat the kernel by
trying implement every possible port restriction combination and count.
For the same reasons we don't do all possible tcp stats that research
community finds useful. This additional code is a maintenance headache.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] net: add test for net cgroup
  2016-08-11  0:53 ` [PATCH 5/5] net: add test for " Anoop Naravaram
@ 2016-08-13  4:38   ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2016-08-13  4:38 UTC (permalink / raw)
  To: Anoop Naravaram
  Cc: corbet, tj, lizefan, hannes, davem, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, netdev, edumazet, maheshb, weiwan,
	tom

On Wed, Aug 10, 2016 at 05:53:18PM -0700, Anoop Naravaram wrote:
> Created a file scripts/cgroup/net_cgroup_test.py that tests the
> functionality of the net cgroup as described in previous commit logs.
> 
> Signed-off-by: Anoop Naravaram <anaravaram@google.com>
> ---
>  scripts/cgroup/net_cgroup_test.py | 359 ++++++++++++++++++++++++++++++++++++++

thanks for the test. It's great to have one, but it is concerning that
the test doesn't cover all the knobs the previous patches introduce
which underlines the point of my concern for previous patches.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] net: add bind/listen ranges to net cgroup
  2016-08-13  4:29   ` Alexei Starovoitov
@ 2016-08-15 19:05     ` Mahesh Bandewar
  2016-08-17  3:02       ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Mahesh Bandewar @ 2016-08-15 19:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Anoop Naravaram, corbet, tj, lizefan, hannes, David Miller,
	kuznet, jmorris, yoshfuji, kaber, linux-doc, cgroups,
	linux-netdev, Eric Dumazet, Wei Wang, tom

On Fri, Aug 12, 2016 at 9:29 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
[...]
>> +static bool range_in_ranges(struct net_range *r, struct net_ranges *rs)
>> +{
>> +     int ri;
>> +
>> +     for (ri = 0; ri < rs->num_entries; ri++)
>> +             if (r->min_value >= rs->range[ri].min_value &&
>> +                 r->max_value <= rs->range[ri].max_value)
>> +                     return true;
>> +
>> +     return false;
>> +}
>> +
>> +/* Returns true if all the ranges in rs1 are subsets of at least one of the
>> + * ranges in rs2, ans returns false otherwise.
>> + */
>> +static bool ranges_in_ranges(struct net_ranges *rs1, struct net_ranges *rs2)
>> +{
>> +     int ri;
>> +
>> +     for (ri = 0; ri < rs1->num_entries; ri++)
>> +             if (!range_in_ranges(&rs1->range[ri], rs2))
>> +                     return false;
>> +
>> +     return true;
>> +}
>
> This is not a scalable implementation. The next step would be to
> do a binary search, hash table or something else here?
> I think the root of the problem is in hard coded checks
> that quickly become inefficient and inadequate as applicability
> of the feature grows.
> We already have BPF that can suite this purpose much better
> without bloating the kernel code with parsing and matching logic.
>
This is not a per packet cost and mostly paid once when you are
establishing the channel. Having said that I do agree with you that
the check can be optimized by doing something like a hash-table etc.

This is an ABI extension to the cgroup so having the code where it is
now makes sense to maintain that ABI compatibility. Implementing using
something else (e.g. eBPF etc.) would just mean that this code needs
to be just moved at a different place than the current place so the
net gain would be nothing.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] net: add udp limit to net cgroup
  2016-08-13  4:35       ` Alexei Starovoitov
@ 2016-08-15 19:16           ` Mahesh Bandewar
  0 siblings, 0 replies; 25+ messages in thread
From: Mahesh Bandewar @ 2016-08-15 19:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Anoop Naravaram, corbet, tj, lizefan, hannes, David Miller,
	kuznet, jmorris, yoshfuji, kaber, linux-doc, cgroups,
	linux-netdev, Eric Dumazet, Wei Wang, tom

On Fri, Aug 12, 2016 at 9:35 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
[...]
>> * 'net.udp_usage': Reading this file gives the number of udp ports used by
>> processes in this cgroup and all its descendants.
>> * 'net.udp_limit': Writing this file sets the total number of udp ports
>> that can be used by processes in this cgroup and all
>> its descendants. This file can also be read.
>> * 'net.udp_maxusage': Reading this file gives the highest value of
>> net.udp_usage that has been seen for this cgroup.
>> * 'net.udp_failcnt': Reading this file gives the number of times a
>> process in this cgroup or one of its descendants has attempted to acquire
>> a udp port but failed because the limit of this cgroup was reached.
>> * 'net.udp_underflowcnt': Reading this file gives the number of times a
>> process in this cgroup or one of its descendants released a udp port when
>> the usage value of this cgroup was 0.
>
> I have similar concern here. I don't think we should bloat the kernel by
> trying implement every possible port restriction combination and count.
> For the same reasons we don't do all possible tcp stats that research
> community finds useful. This additional code is a maintenance headache.
>
At this moment if a rampant process decides to just open a datagram
socket and just binds (does not listen) and does this for all
available ports, you are guaranteed to have machine becoming useless
for some (totally unrelated) legitimate process trying to do useful
things.

>From kernel we can't control such user-space behavior, but at least
restrict the damage to "a specific set of processes". This code
provides that feature to restrict the damage.

Additional code is always a maintenance headache but that doesn't stop
us from adding new code, isn't it? :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] net: add udp limit to net cgroup
@ 2016-08-15 19:16           ` Mahesh Bandewar
  0 siblings, 0 replies; 25+ messages in thread
From: Mahesh Bandewar @ 2016-08-15 19:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Anoop Naravaram, corbet, tj, lizefan, hannes, David Miller,
	kuznet, jmorris, yoshfuji, kaber, linux-doc, cgroups,
	linux-netdev, Eric Dumazet, Wei Wang, tom

On Fri, Aug 12, 2016 at 9:35 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
[...]
>> * 'net.udp_usage': Reading this file gives the number of udp ports used by
>> processes in this cgroup and all its descendants.
>> * 'net.udp_limit': Writing this file sets the total number of udp ports
>> that can be used by processes in this cgroup and all
>> its descendants. This file can also be read.
>> * 'net.udp_maxusage': Reading this file gives the highest value of
>> net.udp_usage that has been seen for this cgroup.
>> * 'net.udp_failcnt': Reading this file gives the number of times a
>> process in this cgroup or one of its descendants has attempted to acquire
>> a udp port but failed because the limit of this cgroup was reached.
>> * 'net.udp_underflowcnt': Reading this file gives the number of times a
>> process in this cgroup or one of its descendants released a udp port when
>> the usage value of this cgroup was 0.
>
> I have similar concern here. I don't think we should bloat the kernel by
> trying implement every possible port restriction combination and count.
> For the same reasons we don't do all possible tcp stats that research
> community finds useful. This additional code is a maintenance headache.
>
At this moment if a rampant process decides to just open a datagram
socket and just binds (does not listen) and does this for all
available ports, you are guaranteed to have machine becoming useless
for some (totally unrelated) legitimate process trying to do useful
things.

From kernel we can't control such user-space behavior, but at least
restrict the damage to "a specific set of processes". This code
provides that feature to restrict the damage.

Additional code is always a maintenance headache but that doesn't stop
us from adding new code, isn't it? :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] net: add bind/listen ranges to net cgroup
  2016-08-15 19:05     ` Mahesh Bandewar
@ 2016-08-17  3:02       ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2016-08-17  3:02 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Anoop Naravaram, corbet, tj, lizefan, hannes, David Miller,
	kuznet, jmorris, yoshfuji, kaber, linux-doc, cgroups,
	linux-netdev, Eric Dumazet, Wei Wang, tom, Daniel Mack

On Mon, Aug 15, 2016 at 12:05:40PM -0700, Mahesh Bandewar wrote:
> On Fri, Aug 12, 2016 at 9:29 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> [...]
> >> +static bool range_in_ranges(struct net_range *r, struct net_ranges *rs)
> >> +{
> >> +     int ri;
> >> +
> >> +     for (ri = 0; ri < rs->num_entries; ri++)
> >> +             if (r->min_value >= rs->range[ri].min_value &&
> >> +                 r->max_value <= rs->range[ri].max_value)
> >> +                     return true;
> >> +
> >> +     return false;
> >> +}
> >> +
> >> +/* Returns true if all the ranges in rs1 are subsets of at least one of the
> >> + * ranges in rs2, ans returns false otherwise.
> >> + */
> >> +static bool ranges_in_ranges(struct net_ranges *rs1, struct net_ranges *rs2)
> >> +{
> >> +     int ri;
> >> +
> >> +     for (ri = 0; ri < rs1->num_entries; ri++)
> >> +             if (!range_in_ranges(&rs1->range[ri], rs2))
> >> +                     return false;
> >> +
> >> +     return true;
> >> +}
> >
> > This is not a scalable implementation. The next step would be to
> > do a binary search, hash table or something else here?
> > I think the root of the problem is in hard coded checks
> > that quickly become inefficient and inadequate as applicability
> > of the feature grows.
> > We already have BPF that can suite this purpose much better
> > without bloating the kernel code with parsing and matching logic.
> >
> This is not a per packet cost and mostly paid once when you are
> establishing the channel. Having said that I do agree with you that
> the check can be optimized by doing something like a hash-table etc.
> 
> This is an ABI extension to the cgroup so having the code where it is
> now makes sense to maintain that ABI compatibility. Implementing using
> something else (e.g. eBPF etc.) would just mean that this code needs
> to be just moved at a different place than the current place so the
> net gain would be nothing.

In cgroup+bpf approach we don't need to extend abi every time new 
port range style or new statistics needs to be added. These hundreds
lines of hard coded checks can be avoided.
Daniel Mack is working on cgroup+bpf patches that will allow
attaching bpf to a cgroup. It will be new networking controller.
The same approach should be used here. I'm proposing to do
the same hooks as net_cgroup_bind_allowed() in inet_bind() and
net_cgroup_listen_allowed() in inet_csk_listen_start(), but the
code that does the check should be a bpf program that decides whether
port permitted or not.
New bpf program type can be trivially introduced that takes single
'portno' argument. The user space will attach it to a cgroup and then
free to do whatever port filtering logic and statistics by changing
the program without ever touching abi. Including hash tables that
are already part of bpf.

Daniel's slightly different approach to cgroup+bpf can do the same
in more generic way by hooking next to sk_filter() and parsing
the packet (it will work for any protocol), but it has
per-packet cost which I understand you want to avoid here by
adding hooks to inet_bind() and inet_csk_listen_start()
which makes sense.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
       [not found] ` <1470876798-4024-1-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2016-08-11  0:53   ` [PATCH 3/5] net: add udp limit " Anoop Naravaram
@ 2016-08-23  7:37   ` Parav Pandit
  1 sibling, 0 replies; 25+ messages in thread
From: Parav Pandit @ 2016-08-23  7:37 UTC (permalink / raw)
  To: Anoop Naravaram
  Cc: Jonathan Corbet, Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	kuznet-v/Mj1YrvjDBInbfyfbPRSQ, jmorris-gx6/JNMH7DfYtjvyW6yDsg,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	maheshb-hpIqsD4AKlfQT0dZR+AlfA, weiwan-hpIqsD4AKlfQT0dZR+AlfA,
	tom-BjP2VixgY4xUbtYUoyoikg

Hi Anoop,


On Thu, Aug 11, 2016 at 6:23 AM, Anoop Naravaram <anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> This patchset introduces a cgroup controller for the networking subsystem as a
> whole. As of now, this controller will be used for:
>
> * Limiting the specific ports that a process in a cgroup is allowed to bind
>   to or listen on. For example, you can say that all the processes in a
>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
>   guarantees that the remaining ports will be available for other processes.
>
> * Restricting which DSCP values processes can use with their sockets. For
>   example, you can say that all the processes in a cgroup can only send
>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>   192 to 255).
>
> * Limiting the total number of udp ports that can be used by a process in a
>   cgroup. For example, you can say that all the processes in one cgroup are
>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>   ports that can be used by all processes is limited, this is useful for
>   rationing out the ports to different process groups.
>
> In the future, more networking-related properties may be added to this
> controller.
>

Since network namespace allows process in each namespace to listen to
same port range in their own namespace.
What is the rationale or use case to limit certain process to view
certain port range?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
                   ` (4 preceding siblings ...)
  2016-08-11  0:53 ` [PATCH 5/5] net: add test for " Anoop Naravaram
@ 2016-08-23  8:49 ` Parav Pandit
  2016-08-25  0:28   ` Mahesh Bandewar (महेश बंडेवार)
  2016-08-24 21:03 ` Tejun Heo
  6 siblings, 1 reply; 25+ messages in thread
From: Parav Pandit @ 2016-08-23  8:49 UTC (permalink / raw)
  To: Anoop Naravaram
  Cc: Jonathan Corbet, Tejun Heo, lizefan, Johannes Weiner, davem,
	kuznet, jmorris, yoshfuji, kaber, linux-doc, cgroups, netdev,
	edumazet, maheshb, weiwan, tom

Hi Anoop,

Regardless of usecase, I think this functionality is best handled as
LSM functionality instead of cgroup.

Tasks which are proposed in this patch are related to access control checks.
LSM already has required hooks for socket operations such as bind(),
listen() as few small examples.

Refer to security_socket_listen() which invokes LSM specific hooks.
This is invoked in source/net/socket.c as part of listen() system call.
LSM hook callback can check whether a given a process can listen to
requested UDP port or not.

Parav


On Thu, Aug 11, 2016 at 6:23 AM, Anoop Naravaram <anaravaram@google.com> wrote:
> This patchset introduces a cgroup controller for the networking subsystem as a
> whole. As of now, this controller will be used for:
>
> * Limiting the specific ports that a process in a cgroup is allowed to bind
>   to or listen on. For example, you can say that all the processes in a
>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
>   guarantees that the remaining ports will be available for other processes.
>
> * Restricting which DSCP values processes can use with their sockets. For
>   example, you can say that all the processes in a cgroup can only send
>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>   192 to 255).
>
> * Limiting the total number of udp ports that can be used by a process in a
>   cgroup. For example, you can say that all the processes in one cgroup are
>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>   ports that can be used by all processes is limited, this is useful for
>   rationing out the ports to different process groups.
>
> In the future, more networking-related properties may be added to this
> controller.
>
> Anoop Naravaram (5):
>   net: create the networking cgroup controller
>   net: add bind/listen ranges to net cgroup
>   net: add udp limit to net cgroup
>   net: add dscp ranges to net cgroup
>   net: add test for net cgroup
>
>  Documentation/cgroup-v1/net.txt   |  95 +++++
>  include/linux/cgroup_subsys.h     |   4 +
>  include/net/net_cgroup.h          | 103 ++++++
>  net/Kconfig                       |  10 +
>  net/core/Makefile                 |   1 +
>  net/core/net_cgroup.c             | 706 ++++++++++++++++++++++++++++++++++++++
>  net/ipv4/af_inet.c                |   8 +
>  net/ipv4/inet_connection_sock.c   |   7 +
>  net/ipv4/ip_sockglue.c            |  13 +
>  net/ipv4/udp.c                    |   8 +
>  net/ipv6/af_inet6.c               |   7 +
>  net/ipv6/datagram.c               |   9 +
>  net/ipv6/ipv6_sockglue.c          |   8 +
>  scripts/cgroup/net_cgroup_test.py | 359 +++++++++++++++++++
>  14 files changed, 1338 insertions(+)
>  create mode 100644 Documentation/cgroup-v1/net.txt
>  create mode 100644 include/net/net_cgroup.h
>  create mode 100644 net/core/net_cgroup.c
>  create mode 100755 scripts/cgroup/net_cgroup_test.py
>
> --
> 2.8.0.rc3.226.g39d4020
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
                   ` (5 preceding siblings ...)
  2016-08-23  8:49 ` [PATCH 0/5] Networking cgroup controller Parav Pandit
@ 2016-08-24 21:03 ` Tejun Heo
  2016-08-25 15:54   ` Mahesh Bandewar (महेश बंडेवार)
  6 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2016-08-24 21:03 UTC (permalink / raw)
  To: Anoop Naravaram
  Cc: corbet, lizefan, hannes, davem, kuznet, jmorris, yoshfuji, kaber,
	linux-doc, cgroups, netdev, edumazet, maheshb, weiwan, tom

Hello, Anoop.

On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
> This patchset introduces a cgroup controller for the networking subsystem as a
> whole. As of now, this controller will be used for:
> 
> * Limiting the specific ports that a process in a cgroup is allowed to bind
>   to or listen on. For example, you can say that all the processes in a
>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
>   guarantees that the remaining ports will be available for other processes.
> 
> * Restricting which DSCP values processes can use with their sockets. For
>   example, you can say that all the processes in a cgroup can only send
>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>   192 to 255).
> 
> * Limiting the total number of udp ports that can be used by a process in a
>   cgroup. For example, you can say that all the processes in one cgroup are
>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>   ports that can be used by all processes is limited, this is useful for
>   rationing out the ports to different process groups.
> 
> In the future, more networking-related properties may be added to this
> controller.

Thanks for working on this; however, I share the sentiment expressed
by others that this looks like too piecemeal an approach.  If there
are no alternatives, we surely should consider this but it at least
*looks* like bpf should be able to cover the same functionalities
without having to revise and extend in-kernel capabilities constantly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-23  8:49 ` [PATCH 0/5] Networking cgroup controller Parav Pandit
@ 2016-08-25  0:28   ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 0 replies; 25+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-08-25  0:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Anoop Naravaram, Jonathan Corbet, Tejun Heo, lizefan,
	Johannes Weiner, David Miller, kuznet, jmorris, yoshfuji, kaber,
	linux-doc, cgroups, linux-netdev, Eric Dumazet, Wei Wang, tom

On Tue, Aug 23, 2016 at 1:49 AM, Parav Pandit <pandit.parav@gmail.com> wrote:
> Hi Anoop,
>
> Regardless of usecase, I think this functionality is best handled as
> LSM functionality instead of cgroup.
>
I'm not so sure about that. Cgroup APIs are useful and this is just an
extension to it.


> Tasks which are proposed in this patch are related to access control checks.
> LSM already has required hooks for socket operations such as bind(),
> listen() as few small examples.
>
> Refer to security_socket_listen() which invokes LSM specific hooks.
> This is invoked in source/net/socket.c as part of listen() system call.
> LSM hook callback can check whether a given a process can listen to
> requested UDP port or not.
>
This has administrative overhead that is not addressed. The underlying
cgroup infrastructure takes care of it in this (current)
implementation.

> Parav
>
>
[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-24 21:03 ` Tejun Heo
@ 2016-08-25 15:54   ` Mahesh Bandewar (महेश बंडेवार)
       [not found]     ` <CAF2d9jjXxjgJU1VWDERKkv4LRvmhpjpyx7Dx76egQx43aaXtmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-08-25 18:04     ` Alexei Starovoitov
  0 siblings, 2 replies; 25+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-08-25 15:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: corbet, lizefan, hannes, David Miller, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, linux-netdev, Eric Dumazet, Wei Wang,
	tom

On Wed, Aug 24, 2016 at 2:03 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Anoop.
>
> On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
>> This patchset introduces a cgroup controller for the networking subsystem as a
>> whole. As of now, this controller will be used for:
>>
>> * Limiting the specific ports that a process in a cgroup is allowed to bind
>>   to or listen on. For example, you can say that all the processes in a
>>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
>>   guarantees that the remaining ports will be available for other processes.
>>
>> * Restricting which DSCP values processes can use with their sockets. For
>>   example, you can say that all the processes in a cgroup can only send
>>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>>   192 to 255).
>>
>> * Limiting the total number of udp ports that can be used by a process in a
>>   cgroup. For example, you can say that all the processes in one cgroup are
>>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>>   ports that can be used by all processes is limited, this is useful for
>>   rationing out the ports to different process groups.
>>
>> In the future, more networking-related properties may be added to this
>> controller.
>
> Thanks for working on this; however, I share the sentiment expressed
> by others that this looks like too piecemeal an approach.  If there
> are no alternatives, we surely should consider this but it at least
> *looks* like bpf should be able to cover the same functionalities
> without having to revise and extend in-kernel capabilities constantly.
>
My primary concern is the cost that need to be paid to get this functionality.
(a) The suggested alternatives eBPF either can't solve the problem in
the current form or need substantial work to get it done. e.g.
udp-port-limit since there is no notion of "maintaining
counters-per-group-of-processes". This is solved by the cgroup infra.
(b) Also the hooks implemented are mostly with a per packet cost vs.
once when you are establishing the channel. Also not sure if the LSM
approach will allow some privileged user to over-ride the filters
attached and thus override the limits imposed. This is on top of the
administrative costs that currently don't have solution for and you
get it for free with cgroup infra.

In short most of the associated problems are handled by the
cgroup-infra / APIs while all that need separate solution in
alternatives.  Tejun, feels like I'm advocating cgroup approach to you
;)

Thanks,
--mahesh..


> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
       [not found]     ` <CAF2d9jjXxjgJU1VWDERKkv4LRvmhpjpyx7Dx76egQx43aaXtmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-25 16:09       ` Tejun Heo
  2016-08-25 16:11         ` Tejun Heo
  2016-08-25 18:53         ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2016-08-25 16:09 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश
	बंडेवार)
  Cc: corbet-T1hC0tSOHrs, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, David Miller,
	kuznet-v/Mj1YrvjDBInbfyfbPRSQ, jmorris-gx6/JNMH7DfYtjvyW6yDsg,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-netdev, Eric Dumazet, Wei Wang, tom-BjP2VixgY4xUbtYUoyoikg

Hello, Mahesh.

On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> In short most of the associated problems are handled by the
> cgroup-infra / APIs while all that need separate solution in
> alternatives.  Tejun, feels like I'm advocating cgroup approach to you
> ;)

My concern here is that the proposed fixed mechanism isn't gonna be
enough.  Port range matching wouldn't scale, so we'd need some hashmap
style thing which may be too expensive for simple matches so either we
do something adaptive or have different interfaces for the two and so
on.  IOW, I think this approach is likely to replicate what iptables
have been doing with its extensions.  I don't doubt that it is one of
the workable approaches but hardly an attractive one especially at
this point.

ebpf approach does have its shortcomings for sure but mending them
seems a lot more manageable and future-proof than going with fixed but
constantly expanding set of operations.  e.g. We can add per-cgroup
bpf programs which are called only on socket creation or other major
events, or just let bpf programs which get called on bind(2), and add
some per-cgroup state variables which are maintained by cgroup code
which can be used from these bpf programs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-25 16:09       ` Tejun Heo
@ 2016-08-25 16:11         ` Tejun Heo
  2016-08-25 18:53         ` Mahesh Bandewar (महेश बंडेवार)
  1 sibling, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2016-08-25 16:11 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश
	बंडेवार)
  Cc: corbet, lizefan, hannes, David Miller, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, linux-netdev, Eric Dumazet, Wei Wang,
	tom

On Thu, Aug 25, 2016 at 12:09:20PM -0400, Tejun Heo wrote:
> ebpf approach does have its shortcomings for sure but mending them
> seems a lot more manageable and future-proof than going with fixed but
> constantly expanding set of operations.  e.g. We can add per-cgroup
> bpf programs which are called only on socket creation or other major
> events, or just let bpf programs which get called on bind(2), and add
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	please ignore this part. half-assed edit.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-25 15:54   ` Mahesh Bandewar (महेश बंडेवार)
       [not found]     ` <CAF2d9jjXxjgJU1VWDERKkv4LRvmhpjpyx7Dx76egQx43aaXtmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-25 18:04     ` Alexei Starovoitov
  2016-08-25 18:56       ` Mahesh Bandewar (महेश बंडेवार)
  1 sibling, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2016-08-25 18:04 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश
	बंडेवार)
  Cc: Tejun Heo, corbet, lizefan, hannes, David Miller, kuznet,
	jmorris, yoshfuji, kaber, linux-doc, cgroups, linux-netdev,
	Eric Dumazet, Wei Wang, tom

On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Wed, Aug 24, 2016 at 2:03 PM, Tejun Heo <tj@kernel.org> wrote:
> > Hello, Anoop.
> >
> > On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
> >> This patchset introduces a cgroup controller for the networking subsystem as a
> >> whole. As of now, this controller will be used for:
> >>
> >> * Limiting the specific ports that a process in a cgroup is allowed to bind
> >>   to or listen on. For example, you can say that all the processes in a
> >>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
> >>   guarantees that the remaining ports will be available for other processes.
> >>
> >> * Restricting which DSCP values processes can use with their sockets. For
> >>   example, you can say that all the processes in a cgroup can only send
> >>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
> >>   192 to 255).
> >>
> >> * Limiting the total number of udp ports that can be used by a process in a
> >>   cgroup. For example, you can say that all the processes in one cgroup are
> >>   allowed to use a total of up to 100 udp ports. Since the total number of udp
> >>   ports that can be used by all processes is limited, this is useful for
> >>   rationing out the ports to different process groups.
> >>
> >> In the future, more networking-related properties may be added to this
> >> controller.
> >
> > Thanks for working on this; however, I share the sentiment expressed
> > by others that this looks like too piecemeal an approach.  If there
> > are no alternatives, we surely should consider this but it at least
> > *looks* like bpf should be able to cover the same functionalities
> > without having to revise and extend in-kernel capabilities constantly.
> >
> My primary concern is the cost that need to be paid to get this functionality.
> (a) The suggested alternatives eBPF either can't solve the problem in
> the current form or need substantial work to get it done. e.g.
> udp-port-limit since there is no notion of "maintaining
> counters-per-group-of-processes". This is solved by the cgroup infra.

what is specifically missing?
there are several ways to do counters in bpf and as soon as bpf program
is attachable to a cgroup, all of these counter features come for free.
Counting bytes or packets or port bind failures or anything else per cgroup
with bpf is trivial. No extra code is needed.

> (b) Also the hooks implemented are mostly with a per packet cost vs.
> once when you are establishing the channel. Also not sure if the LSM
> approach will allow some privileged user to over-ride the filters
> attached and thus override the limits imposed. This is on top of the
> administrative costs that currently don't have solution for and you
> get it for free with cgroup infra.
> 
> In short most of the associated problems are handled by the
> cgroup-infra / APIs while all that need separate solution in
> alternatives.  Tejun, feels like I'm advocating cgroup approach to you
> ;)
> 
> Thanks,
> --mahesh..
> 
> 
> > Thanks.
> >
> > --
> > tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-25 16:09       ` Tejun Heo
  2016-08-25 16:11         ` Tejun Heo
@ 2016-08-25 18:53         ` Mahesh Bandewar (महेश बंडेवार)
  1 sibling, 0 replies; 25+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-08-25 18:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: corbet, lizefan, hannes, David Miller, kuznet, jmorris, yoshfuji,
	kaber, linux-doc, cgroups, linux-netdev, Eric Dumazet, Wei Wang,
	tom

On Thu, Aug 25, 2016 at 9:09 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Mahesh.
>
> On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
>> In short most of the associated problems are handled by the
>> cgroup-infra / APIs while all that need separate solution in
>> alternatives.  Tejun, feels like I'm advocating cgroup approach to you
>> ;)
>
> My concern here is that the proposed fixed mechanism isn't gonna be
> enough.  Port range matching wouldn't scale, so we'd need some hashmap
> style thing which may be too expensive for simple matches so either we
> do something adaptive or have different interfaces for the two and so
> on.  IOW, I think this approach is likely to replicate what iptables
> have been doing with its extensions.  I don't doubt that it is one of
> the workable approaches but hardly an attractive one especially at
> this point.
>
> ebpf approach does have its shortcomings for sure but mending them
> seems a lot more manageable and future-proof than going with fixed but
> constantly expanding set of operations.  e.g. We can add per-cgroup
> bpf programs which are called only on socket creation or other major
> events, or just let bpf programs which get called on bind(2), and add
> some per-cgroup state variables which are maintained by cgroup code
> which can be used from these bpf programs.
>
Well, I haven't seen any of these yet (please point me the right place
if I missed) Especially the hooks that allows users to add per-cgroup
bpf programs that can be used in control-path (I think Daniel's recent
patches allow in data-path).

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-25 18:04     ` Alexei Starovoitov
@ 2016-08-25 18:56       ` Mahesh Bandewar (महेश बंडेवार)
  2016-08-25 21:29         ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-08-25 18:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, corbet, lizefan, hannes, David Miller, kuznet,
	jmorris, yoshfuji, kaber, linux-doc, cgroups, linux-netdev,
	Eric Dumazet, Wei Wang, tom

On Thu, Aug 25, 2016 at 11:04 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
>> On Wed, Aug 24, 2016 at 2:03 PM, Tejun Heo <tj@kernel.org> wrote:
>> > Hello, Anoop.
>> >
>> > On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
>> >> This patchset introduces a cgroup controller for the networking subsystem as a
>> >> whole. As of now, this controller will be used for:
>> >>
>> >> * Limiting the specific ports that a process in a cgroup is allowed to bind
>> >>   to or listen on. For example, you can say that all the processes in a
>> >>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
>> >>   guarantees that the remaining ports will be available for other processes.
>> >>
>> >> * Restricting which DSCP values processes can use with their sockets. For
>> >>   example, you can say that all the processes in a cgroup can only send
>> >>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>> >>   192 to 255).
>> >>
>> >> * Limiting the total number of udp ports that can be used by a process in a
>> >>   cgroup. For example, you can say that all the processes in one cgroup are
>> >>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>> >>   ports that can be used by all processes is limited, this is useful for
>> >>   rationing out the ports to different process groups.
>> >>
>> >> In the future, more networking-related properties may be added to this
>> >> controller.
>> >
>> > Thanks for working on this; however, I share the sentiment expressed
>> > by others that this looks like too piecemeal an approach.  If there
>> > are no alternatives, we surely should consider this but it at least
>> > *looks* like bpf should be able to cover the same functionalities
>> > without having to revise and extend in-kernel capabilities constantly.
>> >
>> My primary concern is the cost that need to be paid to get this functionality.
>> (a) The suggested alternatives eBPF either can't solve the problem in
>> the current form or need substantial work to get it done. e.g.
>> udp-port-limit since there is no notion of "maintaining
>> counters-per-group-of-processes". This is solved by the cgroup infra.
>
> what is specifically missing?
> there are several ways to do counters in bpf and as soon as bpf program
> is attachable to a cgroup, all of these counter features come for free.
> Counting bytes or packets or port bind failures or anything else per cgroup
> with bpf is trivial. No extra code is needed.
>
Alexei, I was referring to the association of eBPF to the cgroup. Lack
of it makes anyone wants to use it invest into additional
administrative infra that you are currently getting with cgroup-infra.

>> (b) Also the hooks implemented are mostly with a per packet cost vs.
>> once when you are establishing the channel. Also not sure if the LSM
>> approach will allow some privileged user to over-ride the filters
>> attached and thus override the limits imposed. This is on top of the
>> administrative costs that currently don't have solution for and you
>> get it for free with cgroup infra.
>>
>> In short most of the associated problems are handled by the
>> cgroup-infra / APIs while all that need separate solution in
>> alternatives.  Tejun, feels like I'm advocating cgroup approach to you
>> ;)
>>
>> Thanks,
>> --mahesh..
>>
>>
>> > Thanks.
>> >
>> > --
>> > tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Networking cgroup controller
  2016-08-25 18:56       ` Mahesh Bandewar (महेश बंडेवार)
@ 2016-08-25 21:29         ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2016-08-25 21:29 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश
	बंडेवार)
  Cc: Tejun Heo, corbet, lizefan, hannes, David Miller, kuznet,
	jmorris, yoshfuji, kaber, linux-doc, cgroups, linux-netdev,
	Eric Dumazet, Wei Wang, tom

On Thu, Aug 25, 2016 at 11:56:27AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Aug 25, 2016 at 11:04 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> >> On Wed, Aug 24, 2016 at 2:03 PM, Tejun Heo <tj@kernel.org> wrote:
> >> > Hello, Anoop.
> >> >
> >> > On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
> >> >> This patchset introduces a cgroup controller for the networking subsystem as a
> >> >> whole. As of now, this controller will be used for:
> >> >>
> >> >> * Limiting the specific ports that a process in a cgroup is allowed to bind
> >> >>   to or listen on. For example, you can say that all the processes in a
> >> >>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, which
> >> >>   guarantees that the remaining ports will be available for other processes.
> >> >>
> >> >> * Restricting which DSCP values processes can use with their sockets. For
> >> >>   example, you can say that all the processes in a cgroup can only send
> >> >>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
> >> >>   192 to 255).
> >> >>
> >> >> * Limiting the total number of udp ports that can be used by a process in a
> >> >>   cgroup. For example, you can say that all the processes in one cgroup are
> >> >>   allowed to use a total of up to 100 udp ports. Since the total number of udp
> >> >>   ports that can be used by all processes is limited, this is useful for
> >> >>   rationing out the ports to different process groups.
> >> >>
> >> >> In the future, more networking-related properties may be added to this
> >> >> controller.
> >> >
> >> > Thanks for working on this; however, I share the sentiment expressed
> >> > by others that this looks like too piecemeal an approach.  If there
> >> > are no alternatives, we surely should consider this but it at least
> >> > *looks* like bpf should be able to cover the same functionalities
> >> > without having to revise and extend in-kernel capabilities constantly.
> >> >
> >> My primary concern is the cost that need to be paid to get this functionality.
> >> (a) The suggested alternatives eBPF either can't solve the problem in
> >> the current form or need substantial work to get it done. e.g.
> >> udp-port-limit since there is no notion of "maintaining
> >> counters-per-group-of-processes". This is solved by the cgroup infra.
> >
> > what is specifically missing?
> > there are several ways to do counters in bpf and as soon as bpf program
> > is attachable to a cgroup, all of these counter features come for free.
> > Counting bytes or packets or port bind failures or anything else per cgroup
> > with bpf is trivial. No extra code is needed.
> >
> Alexei, I was referring to the association of eBPF to the cgroup. Lack
> of it makes anyone wants to use it invest into additional
> administrative infra that you are currently getting with cgroup-infra.

Please look at Daniel's patches. They have been circulating in different
forms for quite some time now. Your bind port filter use case can be
easily added on top. Then the end result is additional ten lines of code
instead of hundreds.
Another alternative is to go cgroup+lsm+bpf route that Sargun and Mickael
are proposing. I think it will also work for your use case.
The goal we all should have is to have common infra that solves the
largest number of use cases.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-08-25 21:29 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11  0:53 [PATCH 0/5] Networking cgroup controller Anoop Naravaram
2016-08-11  0:53 ` [PATCH 1/5] net: create the networking " Anoop Naravaram
2016-08-11  0:53 ` [PATCH 2/5] net: add bind/listen ranges to net cgroup Anoop Naravaram
2016-08-11  0:53   ` Anoop Naravaram
2016-08-13  4:29   ` Alexei Starovoitov
2016-08-15 19:05     ` Mahesh Bandewar
2016-08-17  3:02       ` Alexei Starovoitov
     [not found] ` <1470876798-4024-1-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2016-08-11  0:53   ` [PATCH 3/5] net: add udp limit " Anoop Naravaram
     [not found]     ` <1470876798-4024-4-git-send-email-anaravaram-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2016-08-13  4:35       ` Alexei Starovoitov
2016-08-15 19:16         ` Mahesh Bandewar
2016-08-15 19:16           ` Mahesh Bandewar
2016-08-23  7:37   ` [PATCH 0/5] Networking cgroup controller Parav Pandit
2016-08-11  0:53 ` [PATCH 4/5] net: add dscp ranges to net cgroup Anoop Naravaram
2016-08-11  0:53 ` [PATCH 5/5] net: add test for " Anoop Naravaram
2016-08-13  4:38   ` Alexei Starovoitov
2016-08-23  8:49 ` [PATCH 0/5] Networking cgroup controller Parav Pandit
2016-08-25  0:28   ` Mahesh Bandewar (महेश बंडेवार)
2016-08-24 21:03 ` Tejun Heo
2016-08-25 15:54   ` Mahesh Bandewar (महेश बंडेवार)
     [not found]     ` <CAF2d9jjXxjgJU1VWDERKkv4LRvmhpjpyx7Dx76egQx43aaXtmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-25 16:09       ` Tejun Heo
2016-08-25 16:11         ` Tejun Heo
2016-08-25 18:53         ` Mahesh Bandewar (महेश बंडेवार)
2016-08-25 18:04     ` Alexei Starovoitov
2016-08-25 18:56       ` Mahesh Bandewar (महेश बंडेवार)
2016-08-25 21:29         ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.