[PATCH net-next 0/4] net: allow setting congctl via routing table

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/4] net: allow setting congctl via routing table
@ 2014-12-05 15:24 Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 1/4] net: tcp: refactor reinitialization of congestion control Daniel Borkmann
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Daniel Borkmann @ 2014-12-05 15:24 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

This is the second part of our work and allows for setting the congestion
control algorithm via routing table. For details, please see individual
patches.

Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.

Thanks!

Daniel Borkmann (4):
  net: tcp: refactor reinitialization of congestion control
  net: tcp: add key management to congestion control
  net: tcp: add RTAX_CC_ALGO fib handling
  net: tcp: add per route congestion control

 include/net/inet_connection_sock.h |  3 +-
 include/net/tcp.h                  | 22 +++++++++-
 include/uapi/linux/rtnetlink.h     |  2 +
 net/core/rtnetlink.c               | 15 ++++++-
 net/decnet/dn_fib.c                |  3 +-
 net/decnet/dn_table.c              |  3 +-
 net/ipv4/fib_semantics.c           | 14 +++++-
 net/ipv4/tcp_cong.c                | 87 ++++++++++++++++++++++++++++++++------
 net/ipv4/tcp_ipv4.c                |  2 +
 net/ipv4/tcp_minisocks.c           | 30 +++++++++++--
 net/ipv4/tcp_output.c              | 21 +++++++++
 net/ipv6/ip6_fib.c                 | 15 ++++++-
 net/ipv6/route.c                   |  3 +-
 net/ipv6/tcp_ipv6.c                |  2 +
 14 files changed, 196 insertions(+), 26 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next 1/4] net: tcp: refactor reinitialization of congestion control
  2014-12-05 15:24 [PATCH net-next 0/4] net: allow setting congctl via routing table Daniel Borkmann
@ 2014-12-05 15:24 ` Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 2/4] net: tcp: add key management to " Daniel Borkmann
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2014-12-05 15:24 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

We can just move this to an extra function and make the code
a bit more readable, no functional change.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/ipv4/tcp_cong.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..38f2f8a 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -107,6 +107,18 @@ void tcp_init_congestion_control(struct sock *sk)
 		icsk->icsk_ca_ops->init(sk);
 }
 
+static void tcp_reinit_congestion_control(struct sock *sk,
+					  const struct tcp_congestion_ops *ca)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	tcp_cleanup_congestion_control(sk);
+	icsk->icsk_ca_ops = ca;
+
+	if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
+		icsk->icsk_ca_ops->init(sk);
+}
+
 /* Manage refcounts on socket close. */
 void tcp_cleanup_congestion_control(struct sock *sk)
 {
@@ -262,21 +274,13 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 #endif
 	if (!ca)
 		err = -ENOENT;
-
 	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
 		   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		err = -EPERM;
-
 	else if (!try_module_get(ca->owner))
 		err = -EBUSY;
-
-	else {
-		tcp_cleanup_congestion_control(sk);
-		icsk->icsk_ca_ops = ca;
-
-		if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
-			icsk->icsk_ca_ops->init(sk);
-	}
+	else
+		tcp_reinit_congestion_control(sk, ca);
  out:
 	rcu_read_unlock();
 	return err;
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 2/4] net: tcp: add key management to congestion control
  2014-12-05 15:24 [PATCH net-next 0/4] net: allow setting congctl via routing table Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 1/4] net: tcp: refactor reinitialization of congestion control Daniel Borkmann
@ 2014-12-05 15:24 ` Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 3/4] net: tcp: add RTAX_CC_ALGO fib handling Daniel Borkmann
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2014-12-05 15:24 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

For a per-route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/inet_connection_sock.h |  3 +-
 include/net/tcp.h                  |  9 +++++-
 net/ipv4/tcp_cong.c                | 63 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 848e85c..5976bde 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -98,7 +98,8 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-	__u8			  icsk_ca_state;
+	__u8			  icsk_ca_state:7,
+				  icsk_ca_dst_locked:1;
 	__u8			  icsk_retransmits;
 	__u8			  icsk_pending;
 	__u8			  icsk_backoff;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..135b70c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,6 +787,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_MAX	128
 #define TCP_CA_BUF_MAX	(TCP_CA_NAME_MAX*TCP_CA_MAX)
 
+#define TCP_CA_UNSPEC	0
+
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
@@ -794,7 +796,8 @@ enum tcp_ca_ack_event_flags {
 
 struct tcp_congestion_ops {
 	struct list_head	list;
-	unsigned long flags;
+	u32 key;
+	u32 flags;
 
 	/* initialize private data (optional) */
 	void (*init)(struct sock *sk);
@@ -841,6 +844,10 @@ u32 tcp_reno_ssthresh(struct sock *sk);
 void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
+u32 tcp_ca_get_key_by_name(const char *name);
+char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 38f2f8a..8fd348c 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -13,6 +13,7 @@
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/gfp.h>
+#include <linux/jhash.h>
 #include <net/tcp.h>
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -31,6 +32,19 @@ static struct tcp_congestion_ops *tcp_ca_find(const char *name)
 	return NULL;
 }
 
+/* Simple linear search, not much in here. */
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
+{
+	struct tcp_congestion_ops *e;
+
+	list_for_each_entry_rcu(e, &tcp_cong_list, list) {
+		if (e->key == key)
+			return e;
+	}
+
+	return NULL;
+}
+
 /*
  * Attach new congestion control algorithm to the list
  * of available options.
@@ -45,9 +59,12 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
 		return -EINVAL;
 	}
 
+	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
+
 	spin_lock(&tcp_cong_list_lock);
-	if (tcp_ca_find(ca->name)) {
-		pr_notice("%s already registered\n", ca->name);
+	if (ca->key == TCP_CA_UNSPEC || tcp_ca_find_key(ca->key)) {
+		pr_notice("%s already registered or non-unique key\n",
+			  ca->name);
 		ret = -EEXIST;
 	} else {
 		list_add_tail_rcu(&ca->list, &tcp_cong_list);
@@ -70,9 +87,48 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 	spin_lock(&tcp_cong_list_lock);
 	list_del_rcu(&ca->list);
 	spin_unlock(&tcp_cong_list_lock);
+
+	/* Wait for outstanding readers to complete before the
+	 * module gets removed entirely.
+	 *
+	 * A try_module_get() should fail by now as our module is
+	 * in "going" state since no refs are held anymore and
+	 * module_exit() handler being called.
+	 */
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
+u32 tcp_ca_get_key_by_name(const char *name)
+{
+	const struct tcp_congestion_ops *ca;
+	u32 key;
+
+	rcu_read_lock();
+	ca = tcp_ca_find(name);
+	key = ca ? ca->key : TCP_CA_UNSPEC;
+	rcu_read_unlock();
+
+	return key;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_key_by_name);
+
+char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	const struct tcp_congestion_ops *ca;
+	char *ret = NULL;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(key);
+	if (ca)
+		ret = strncpy(buffer, ca->name,
+			      TCP_CA_NAME_MAX);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_name_by_key);
+
 /* Assign choice of congestion control. */
 void tcp_assign_congestion_control(struct sock *sk)
 {
@@ -256,6 +312,9 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 	struct tcp_congestion_ops *ca;
 	int err = 0;
 
+	if (icsk->icsk_ca_dst_locked)
+		return -EPERM;
+
 	rcu_read_lock();
 	ca = tcp_ca_find(name);
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 3/4] net: tcp: add RTAX_CC_ALGO fib handling
  2014-12-05 15:24 [PATCH net-next 0/4] net: allow setting congctl via routing table Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 1/4] net: tcp: refactor reinitialization of congestion control Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 2/4] net: tcp: add key management to " Daniel Borkmann
@ 2014-12-05 15:24 ` Daniel Borkmann
  2014-12-05 15:24 ` [PATCH net-next 4/4] net: tcp: add per route congestion control Daniel Borkmann
  2014-12-05 16:35 ` [PATCH net-next 0/4] net: allow setting congctl via routing table Dave Taht
  4 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2014-12-05 15:24 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it. Trying to load a non
present congestion algorithm name will be rejected.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h              |  7 +++++++
 include/uapi/linux/rtnetlink.h |  2 ++
 net/core/rtnetlink.c           | 15 +++++++++++++--
 net/decnet/dn_fib.c            |  3 ++-
 net/decnet/dn_table.c          |  3 ++-
 net/ipv4/fib_semantics.c       | 14 ++++++++++++--
 net/ipv6/ip6_fib.c             | 15 ++++++++++++++-
 net/ipv6/route.c               |  3 ++-
 8 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 135b70c..95bb237 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -846,7 +846,14 @@ extern struct tcp_congestion_ops tcp_reno;
 
 struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
 u32 tcp_ca_get_key_by_name(const char *name);
+#ifdef CONFIG_INET
 char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+#else
+static inline char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	return NULL;
+}
+#endif
 
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 9c9b8b4..d81f22d 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -389,6 +389,8 @@ enum {
 #define RTAX_INITRWND RTAX_INITRWND
 	RTAX_QUICKACK,
 #define RTAX_QUICKACK RTAX_QUICKACK
+	RTAX_CC_ALGO,
+#define RTAX_CC_ALGO RTAX_CC_ALGO
 	__RTAX_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 61cb7e7..3566f41 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -50,6 +50,7 @@
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/udp.h>
+#include <net/tcp.h>
 #include <net/sock.h>
 #include <net/pkt_sched.h>
 #include <net/fib_rules.h>
@@ -669,9 +670,19 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics)
 
 	for (i = 0; i < RTAX_MAX; i++) {
 		if (metrics[i]) {
+			if (i == RTAX_CC_ALGO - 1) {
+				char tmp[TCP_CA_NAME_MAX], *name;
+
+				name = tcp_ca_get_name_by_key(metrics[i], tmp);
+				if (!name)
+					continue;
+				if (nla_put_string(skb, i + 1, name))
+					goto nla_put_failure;
+			} else {
+				if (nla_put_u32(skb, i + 1, metrics[i]))
+					goto nla_put_failure;
+			}
 			valid++;
-			if (nla_put_u32(skb, i+1, metrics[i]))
-				goto nla_put_failure;
 		}
 	}
 
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index d332aef..df48034 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -298,7 +298,8 @@ struct dn_fib_info *dn_fib_create_info(const struct rtmsg *r, struct nlattr *att
 			int type = nla_type(attr);
 
 			if (type) {
-				if (type > RTAX_MAX || nla_len(attr) < 4)
+				if (type > RTAX_MAX || type == RTAX_CC_ALGO ||
+				    nla_len(attr) < 4)
 					goto err_inval;
 
 				fi->fib_metrics[type-1] = nla_get_u32(attr);
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index 86e3807..75c20a9 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -273,7 +273,8 @@ static inline size_t dn_fib_nlmsg_size(struct dn_fib_info *fi)
 	size_t payload = NLMSG_ALIGN(sizeof(struct rtmsg))
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(2) /* RTA_DST */
-			 + nla_total_size(4); /* RTA_PRIORITY */
+			 + nla_total_size(4) /* RTA_PRIORITY */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f99f41b..d2b7b55 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -360,7 +360,8 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(4) /* RTA_DST */
 			 + nla_total_size(4) /* RTA_PRIORITY */
-			 + nla_total_size(4); /* RTA_PREFSRC */
+			 + nla_total_size(4) /* RTA_PREFSRC */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
@@ -859,7 +860,16 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 				if (type > RTAX_MAX)
 					goto err_inval;
-				val = nla_get_u32(nla);
+				if (type == RTAX_CC_ALGO) {
+					char tmp[TCP_CA_NAME_MAX];
+
+					nla_strlcpy(tmp, nla, sizeof(tmp));
+					val = tcp_ca_get_key_by_name(tmp);
+					if (val == TCP_CA_UNSPEC)
+						goto err_inval;
+				} else {
+					val = nla_get_u32(nla);
+				}
 				if (type == RTAX_ADVMSS && val > 65535 - 40)
 					val = 65535 - 40;
 				if (type == RTAX_MTU && val > 65535 - 15)
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index b2d1838..0998ac6 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -30,6 +30,7 @@
 #include <linux/slab.h>
 
 #include <net/ipv6.h>
+#include <net/tcp.h>
 #include <net/ndisc.h>
 #include <net/addrconf.h>
 
@@ -650,10 +651,22 @@ static int fib6_commit_metrics(struct dst_entry *dst,
 		int type = nla_type(nla);
 
 		if (type) {
+			u32 val;
+
 			if (type > RTAX_MAX)
 				return -EINVAL;
+			if (type == RTAX_CC_ALGO) {
+				char tmp[TCP_CA_NAME_MAX];
+
+				nla_strlcpy(tmp, nla, sizeof(tmp));
+				val = tcp_ca_get_key_by_name(tmp);
+				if (val == TCP_CA_UNSPEC)
+					return -EINVAL;
+			} else {
+				val = nla_get_u32(nla);
+			}
 
-			mp[type - 1] = nla_get_u32(nla);
+			mp[type - 1] = val;
 		}
 	}
 	return 0;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c910831..818c99a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2534,7 +2534,8 @@ static inline size_t rt6_nlmsg_size(void)
 	       + nla_total_size(4) /* RTA_OIF */
 	       + nla_total_size(4) /* RTA_PRIORITY */
 	       + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */
-	       + nla_total_size(sizeof(struct rta_cacheinfo));
+	       + nla_total_size(sizeof(struct rta_cacheinfo))
+	       + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 }
 
 static int rt6_fill_node(struct net *net,
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 4/4] net: tcp: add per route congestion control
  2014-12-05 15:24 [PATCH net-next 0/4] net: allow setting congctl via routing table Daniel Borkmann
                   ` (2 preceding siblings ...)
  2014-12-05 15:24 ` [PATCH net-next 3/4] net: tcp: add RTAX_CC_ALGO fib handling Daniel Borkmann
@ 2014-12-05 15:24 ` Daniel Borkmann
  2014-12-05 16:35 ` [PATCH net-next 0/4] net: allow setting congctl via routing table Dave Taht
  4 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2014-12-05 15:24 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

This work adds the possibility to define a per route/destination congestion
control algorithm. Generally, this opens up the possibility for a machine
with different links to enforce specific congestion control algorithms with
optimal strategies for each of them based on their network characteristics,
even transparently for a single application listening on all links.

For our specific use case, this additionally facilitates deployment of DCTCP,
for example, applications can easily serve internal traffic/dsts in DCTCP and
external one with CUBIC. Other scenarios would also allow for utilizing e.g.
long living, low priority background flows for certain destinations/routes
while still being able for normal traffic to utilize the default congestion
control algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific property, we
argue that a per route/destination setting is the most natural and flexible.

The administrator can utilize this through ip-route(8) by appending "congctl
[lock] <name>", where <name> denotes the name of a loaded congestion control
algorithm and the optional lock parameter allows to enforce the given algorithm
so that applications in user space would not be allowed to overwrite that
algorithm for that destination.

The dst metric lookups are being done when a dst entry is already available
in order to avoid a costly lookup and still before the algorithms are being
initialized, thus overhead is very low when the feature is not being used.
While the client side would need to drop the current reference on the module,
on server side this can actually even be avoided as we just got a flat-copied
socket clone.

In case a dst metric contains a key to a congestion control module that has
just been removed from the kernel, it will fallback to the default congestion
control discipline for those and not reassign anything. Should at a later
point in time that module be reinserted into the kernel, then this algorithm
will be used again as keys are deterministic/static.

Joint work with Florian Westphal.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h        |  6 ++++++
 net/ipv4/tcp_ipv4.c      |  2 ++
 net/ipv4/tcp_minisocks.c | 30 ++++++++++++++++++++++++++----
 net/ipv4/tcp_output.c    | 21 +++++++++++++++++++++
 net/ipv6/tcp_ipv6.c      |  2 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 95bb237..b8fdc6b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -448,6 +448,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb);
 struct sock *tcp_create_openreq_child(struct sock *sk,
 				      struct request_sock *req,
 				      struct sk_buff *skb);
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst);
 struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 				  struct request_sock *req,
 				  struct dst_entry *dst);
@@ -636,6 +637,11 @@ static inline u32 tcp_rto_min_us(struct sock *sk)
 	return jiffies_to_usecs(tcp_rto_min(sk));
 }
 
+static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
+{
+	return dst_metric_locked(dst, RTAX_CC_ALGO);
+}
+
 /* Compute the actual receive window we are currently advertising.
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 33f5ff0..ed94fdc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1340,6 +1340,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 	sk_setup_caps(newsk, dst);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 63d2680..bc9216d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -399,6 +399,32 @@ static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
 }
 
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+	bool ca_got_dst = false;
+
+	if (ca_key != TCP_CA_UNSPEC) {
+		const struct tcp_congestion_ops *ca;
+
+		rcu_read_lock();
+		ca = tcp_ca_find_key(ca_key);
+		if (likely(ca && try_module_get(ca->owner))) {
+			icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+			icsk->icsk_ca_ops = ca;
+			ca_got_dst = true;
+		}
+		rcu_read_unlock();
+	}
+
+	if (!ca_got_dst && !try_module_get(icsk->icsk_ca_ops->owner))
+		tcp_assign_congestion_control(sk);
+
+	tcp_set_ca_state(sk, TCP_CA_Open);
+}
+EXPORT_SYMBOL_GPL(tcp_ca_openreq_child);
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -451,10 +477,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->snd_cwnd = TCP_INIT_CWND;
 		newtp->snd_cwnd_cnt = 0;
 
-		if (!try_module_get(newicsk->icsk_ca_ops->owner))
-			tcp_assign_congestion_control(newsk);
-
-		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f5bd4bd..5eb5676 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2916,6 +2916,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 }
 EXPORT_SYMBOL(tcp_make_synack);
 
+static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	const struct tcp_congestion_ops *ca;
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+
+	if (ca_key == TCP_CA_UNSPEC)
+		return;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(ca_key);
+	if (likely(ca && try_module_get(ca->owner))) {
+		module_put(icsk->icsk_ca_ops->owner);
+		icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+		icsk->icsk_ca_ops = ca;
+	}
+	rcu_read_unlock();
+}
+
 /* Do all connect socket setups that can be done AF independent. */
 static void tcp_connect_init(struct sock *sk)
 {
@@ -2941,6 +2960,8 @@ static void tcp_connect_init(struct sock *sk)
 	tcp_mtup_init(sk);
 	tcp_sync_mss(sk, dst_mtu(dst));
 
+	tcp_ca_dst_init(sk, dst);
+
 	if (!tp->window_clamp)
 		tp->window_clamp = dst_metric(dst, RTAX_WINDOW);
 	tp->advmss = dst_metric_advmss(dst);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d06af89..20f44fe 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1199,6 +1199,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		inet_csk(newsk)->icsk_ext_hdr_len = (newnp->opt->opt_nflen +
 						     newnp->opt->opt_flen);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/4] net: allow setting congctl via routing table
  2014-12-05 15:24 [PATCH net-next 0/4] net: allow setting congctl via routing table Daniel Borkmann
                   ` (3 preceding siblings ...)
  2014-12-05 15:24 ` [PATCH net-next 4/4] net: tcp: add per route congestion control Daniel Borkmann
@ 2014-12-05 16:35 ` Dave Taht
  2014-12-05 18:35   ` Hannes Frederic Sowa
  4 siblings, 1 reply; 9+ messages in thread
From: Dave Taht @ 2014-12-05 16:35 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, Hannes Frederic Sowa, Florian Westphal, netdev

On Fri, Dec 5, 2014 at 7:24 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> This is the second part of our work and allows for setting the congestion
> control algorithm via routing table. For details, please see individual
> patches.
>
> Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
>
> Thanks!
>
> Daniel Borkmann (4):
>   net: tcp: refactor reinitialization of congestion control
>   net: tcp: add key management to congestion control
>   net: tcp: add RTAX_CC_ALGO fib handling
>   net: tcp: add per route congestion control


Very interesting. Have you tried something other than dctcp here
(e.g. westwood or lp?)

Have you considered the case where the route changes underneath
you from one device to another?

Example, here I am routing everything through eth0, where I
would want cubic, probably...

root@ganesha:~/git/tinc# ip route
default via 172.26.16.1 dev eth0  proto babel onlink
69.181.216.0/22 via 172.26.16.1 dev eth0  proto babel onlink
169.254.0.0/16 dev eth0  scope link  metric 1000
172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
172.26.16.1 via 172.26.16.1 dev eth0  proto babel onlink
172.26.16.112 via 172.26.16.112 dev eth0  proto babel onlink
172.26.17.0/24 via 172.26.16.1 dev eth0  proto babel onlink
172.26.17.3 via 172.26.16.1 dev eth0  proto babel onlink
172.26.17.227 via 172.26.16.1 dev eth0  proto babel onlink
192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
192.168.7.2 via 172.26.16.112 dev eth0  proto babel onlink

And I pull the plug, and everything flips over to wlan0,
where I might want westwood (or something saner than
that. It might be nice to have a per-device cc default
algorithm...)

root@ganesha:~/git/tinc# ip route
default via 172.26.17.224 dev wlan0  proto babel onlink
69.181.216.0/22 via 172.26.17.224 dev wlan0  proto babel onlink
169.254.0.0/16 dev eth0  scope link  metric 1000
172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
172.26.16.1 via 172.26.17.227 dev wlan0  proto babel onlink
172.26.16.112 via 172.26.17.227 dev wlan0  proto babel onlink
172.26.17.0/24 via 172.26.17.224 dev wlan0  proto babel onlink
172.26.17.3 via 172.26.17.227 dev wlan0  proto babel onlink
172.26.17.227 via 172.26.17.227 dev wlan0  proto babel onlink
192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
192.168.7.2 via 172.26.17.227 dev wlan0  proto babel onlink



-- 
Dave Täht
the quest for a fq-friendly vpn with no head of line blocking continues!
https://plus.google.com/u/0/107942175615993706558/posts/QWPWLoGMtrm

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/4] net: allow setting congctl via routing table
  2014-12-05 16:35 ` [PATCH net-next 0/4] net: allow setting congctl via routing table Dave Taht
@ 2014-12-05 18:35   ` Hannes Frederic Sowa
  2014-12-05 19:05     ` Dave Taht
  0 siblings, 1 reply; 9+ messages in thread
From: Hannes Frederic Sowa @ 2014-12-05 18:35 UTC (permalink / raw)
  To: Dave Taht; +Cc: Daniel Borkmann, davem, Florian Westphal, netdev

On Fr, 2014-12-05 at 08:35 -0800, Dave Taht wrote:
> On Fri, Dec 5, 2014 at 7:24 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> > This is the second part of our work and allows for setting the congestion
> > control algorithm via routing table. For details, please see individual
> > patches.
> >
> > Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
> >
> > Thanks!
> >
> > Daniel Borkmann (4):
> >   net: tcp: refactor reinitialization of congestion control
> >   net: tcp: add key management to congestion control
> >   net: tcp: add RTAX_CC_ALGO fib handling
> >   net: tcp: add per route congestion control
> 
> 
> Very interesting. Have you tried something other than dctcp here
> (e.g. westwood or lp?)
> 
> Have you considered the case where the route changes underneath
> you from one device to another?

Notice, there is no way the state of a tcp congestion control algorithm
can be converted to be used by a different one, so this would only
affect new tcp connections via this interface.

> Example, here I am routing everything through eth0, where I
> would want cubic, probably...
> 
> root@ganesha:~/git/tinc# ip route
> default via 172.26.16.1 dev eth0  proto babel onlink
> 69.181.216.0/22 via 172.26.16.1 dev eth0  proto babel onlink
> 169.254.0.0/16 dev eth0  scope link  metric 1000
> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
> 172.26.16.1 via 172.26.16.1 dev eth0  proto babel onlink
> 172.26.16.112 via 172.26.16.112 dev eth0  proto babel onlink
> 172.26.17.0/24 via 172.26.16.1 dev eth0  proto babel onlink
> 172.26.17.3 via 172.26.16.1 dev eth0  proto babel onlink
> 172.26.17.227 via 172.26.16.1 dev eth0  proto babel onlink
> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
> 192.168.7.2 via 172.26.16.112 dev eth0  proto babel onlink
> 
> And I pull the plug, and everything flips over to wlan0,
> where I might want westwood (or something saner than
> that. It might be nice to have a per-device cc default
> algorithm...)

Something like that might be possible with metrics and "via ... dev if0
metric xxx" routes, which will be cleaned up as soon as the interface
goes down and the fallback will be to a route with a different
congestion algorithm.

> root@ganesha:~/git/tinc# ip route
> default via 172.26.17.224 dev wlan0  proto babel onlink
> 69.181.216.0/22 via 172.26.17.224 dev wlan0  proto babel onlink
> 169.254.0.0/16 dev eth0  scope link  metric 1000
> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
> 172.26.16.1 via 172.26.17.227 dev wlan0  proto babel onlink
> 172.26.16.112 via 172.26.17.227 dev wlan0  proto babel onlink
> 172.26.17.0/24 via 172.26.17.224 dev wlan0  proto babel onlink
> 172.26.17.3 via 172.26.17.227 dev wlan0  proto babel onlink
> 172.26.17.227 via 172.26.17.227 dev wlan0  proto babel onlink
> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
> 192.168.7.2 via 172.26.17.227 dev wlan0  proto babel onlink
> 

Bye,
Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/4] net: allow setting congctl via routing table
  2014-12-05 18:35   ` Hannes Frederic Sowa
@ 2014-12-05 19:05     ` Dave Taht
  2014-12-05 21:03       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Taht @ 2014-12-05 19:05 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Daniel Borkmann, davem, Florian Westphal, netdev

On Fri, Dec 5, 2014 at 10:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Fr, 2014-12-05 at 08:35 -0800, Dave Taht wrote:
>> On Fri, Dec 5, 2014 at 7:24 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> > This is the second part of our work and allows for setting the congestion
>> > control algorithm via routing table. For details, please see individual
>> > patches.
>> >
>> > Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
>> >
>> > Thanks!
>> >
>> > Daniel Borkmann (4):
>> >   net: tcp: refactor reinitialization of congestion control
>> >   net: tcp: add key management to congestion control
>> >   net: tcp: add RTAX_CC_ALGO fib handling
>> >   net: tcp: add per route congestion control
>>
>>
>> Very interesting. Have you tried something other than dctcp here
>> (e.g. westwood or lp?)
>>
>> Have you considered the case where the route changes underneath
>> you from one device to another?
>
> Notice, there is no way the state of a tcp congestion control algorithm
> can be converted to be used by a different one, so this would only
> affect new tcp connections via this interface.

You are missing the point. If the route changes from a path that
is DCTCP capable to one that is not, (say you fail over to a backup link)

and flows persist, bad things will happen. DCTCP, in particular, depends
upon a very specific AQM configuration on all the hops in the path, without that
it can be very aggressive.

I do think it is feasible to convert from at least some of the
core state from one tcp congestion control algorithm to another.

>> Example, here I am routing everything through eth0, where I
>> would want cubic, probably...
>>
>> root@ganesha:~/git/tinc# ip route
>> default via 172.26.16.1 dev eth0  proto babel onlink
>> 69.181.216.0/22 via 172.26.16.1 dev eth0  proto babel onlink
>> 169.254.0.0/16 dev eth0  scope link  metric 1000
>> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
>> 172.26.16.1 via 172.26.16.1 dev eth0  proto babel onlink
>> 172.26.16.112 via 172.26.16.112 dev eth0  proto babel onlink
>> 172.26.17.0/24 via 172.26.16.1 dev eth0  proto babel onlink
>> 172.26.17.3 via 172.26.16.1 dev eth0  proto babel onlink
>> 172.26.17.227 via 172.26.16.1 dev eth0  proto babel onlink
>> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
>> 192.168.7.2 via 172.26.16.112 dev eth0  proto babel onlink
>>
>> And I pull the plug, and everything flips over to wlan0,
>> where I might want westwood (or something saner than
>> that. It might be nice to have a per-device cc default
>> algorithm...)
>
> Something like that might be possible with metrics and "via ... dev if0
> metric xxx" routes, which will be cleaned up as soon as the interface
> goes down and the fallback will be to a route with a different
> congestion algorithm.

mmm... I do dynamic routing via various routing protocols, which
generally don't bother with inserting more than one metric.

While we are thinking through this, what happens with tunnels?

This route in my network switches between interfaces and routes
depending on which is best.

fde5:dfb9:df90:fff0::/64 dev vpn6  proto kernel  metric 256
fde5:dfb9:df90:fff0::/60 via fde5:dfb9:df90:fff0::1 dev vpn6  metric 1024


>> root@ganesha:~/git/tinc# ip route
>> default via 172.26.17.224 dev wlan0  proto babel onlink
>> 69.181.216.0/22 via 172.26.17.224 dev wlan0  proto babel onlink
>> 169.254.0.0/16 dev eth0  scope link  metric 1000
>> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
>> 172.26.16.1 via 172.26.17.227 dev wlan0  proto babel onlink
>> 172.26.16.112 via 172.26.17.227 dev wlan0  proto babel onlink
>> 172.26.17.0/24 via 172.26.17.224 dev wlan0  proto babel onlink
>> 172.26.17.3 via 172.26.17.227 dev wlan0  proto babel onlink
>> 172.26.17.227 via 172.26.17.227 dev wlan0  proto babel onlink
>> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
>> 192.168.7.2 via 172.26.17.227 dev wlan0  proto babel onlink
>>
>
> Bye,
> Hannes
>
>



-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/4] net: allow setting congctl via routing table
  2014-12-05 19:05     ` Dave Taht
@ 2014-12-05 21:03       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Frederic Sowa @ 2014-12-05 21:03 UTC (permalink / raw)
  To: Dave Taht; +Cc: Daniel Borkmann, davem, Florian Westphal, netdev

Hi Dave,

On Fr, 2014-12-05 at 11:05 -0800, Dave Taht wrote:
> On Fri, Dec 5, 2014 at 10:35 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > On Fr, 2014-12-05 at 08:35 -0800, Dave Taht wrote:
> >> On Fri, Dec 5, 2014 at 7:24 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> >> > This is the second part of our work and allows for setting the congestion
> >> > control algorithm via routing table. For details, please see individual
> >> > patches.
> >> >
> >> > Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
> >> >
> >> > Thanks!
> >> >
> >> > Daniel Borkmann (4):
> >> >   net: tcp: refactor reinitialization of congestion control
> >> >   net: tcp: add key management to congestion control
> >> >   net: tcp: add RTAX_CC_ALGO fib handling
> >> >   net: tcp: add per route congestion control
> >>
> >>
> >> Very interesting. Have you tried something other than dctcp here
> >> (e.g. westwood or lp?)
> >>
> >> Have you considered the case where the route changes underneath
> >> you from one device to another?
> >
> > Notice, there is no way the state of a tcp congestion control algorithm
> > can be converted to be used by a different one, so this would only
> > affect new tcp connections via this interface.
> 
> You are missing the point. If the route changes from a path that
> is DCTCP capable to one that is not, (say you fail over to a backup link)

I don't think that today's datacenter are designed that the backup path
has less performance than the primary link (different AQM settings). It
is much more important e.g. to allow the connections to a e.g. database
server selecting dctcp as CC and having all connections going to the
internet using some "ordinary" tcp congestion algorithm.

> and flows persist, bad things will happen. DCTCP, in particular, depends
> upon a very specific AQM configuration on all the hops in the path, without that
> it can be very aggressive.

That's for sure.

> I do think it is feasible to convert from at least some of the
> core state from one tcp congestion control algorithm to another.

Hmm, I haven't looked if that is possible. It might be.

> >> Example, here I am routing everything through eth0, where I
> >> would want cubic, probably...
> >>
> >> root@ganesha:~/git/tinc# ip route
> >> default via 172.26.16.1 dev eth0  proto babel onlink
> >> 69.181.216.0/22 via 172.26.16.1 dev eth0  proto babel onlink
> >> 169.254.0.0/16 dev eth0  scope link  metric 1000
> >> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
> >> 172.26.16.1 via 172.26.16.1 dev eth0  proto babel onlink
> >> 172.26.16.112 via 172.26.16.112 dev eth0  proto babel onlink
> >> 172.26.17.0/24 via 172.26.16.1 dev eth0  proto babel onlink
> >> 172.26.17.3 via 172.26.16.1 dev eth0  proto babel onlink
> >> 172.26.17.227 via 172.26.16.1 dev eth0  proto babel onlink
> >> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
> >> 192.168.7.2 via 172.26.16.112 dev eth0  proto babel onlink
> >>
> >> And I pull the plug, and everything flips over to wlan0,
> >> where I might want westwood (or something saner than
> >> that. It might be nice to have a per-device cc default
> >> algorithm...)
> >
> > Something like that might be possible with metrics and "via ... dev if0
> > metric xxx" routes, which will be cleaned up as soon as the interface
> > goes down and the fallback will be to a route with a different
> > congestion algorithm.
> 
> mmm... I do dynamic routing via various routing protocols, which
> generally don't bother with inserting more than one metric.

I totally understand, they might even remove the routes and re-add them,
thus losing the tcp cc property.

> While we are thinking through this, what happens with tunnels?

Tunnels should behave just like ordinary interfaces, but depending how
they get routed it might make problems regarding DCTCP.

> This route in my network switches between interfaces and routes
> depending on which is best.
> 
> fde5:dfb9:df90:fff0::/64 dev vpn6  proto kernel  metric 256
> fde5:dfb9:df90:fff0::/60 via fde5:dfb9:df90:fff0::1 dev vpn6  metric 1024
> 
> 
> >> root@ganesha:~/git/tinc# ip route
> >> default via 172.26.17.224 dev wlan0  proto babel onlink
> >> 69.181.216.0/22 via 172.26.17.224 dev wlan0  proto babel onlink
> >> 169.254.0.0/16 dev eth0  scope link  metric 1000
> >> 172.26.16.0/24 dev eth0  proto kernel  scope link  src 172.26.16.177
> >> 172.26.16.1 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 172.26.16.112 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 172.26.17.0/24 via 172.26.17.224 dev wlan0  proto babel onlink
> >> 172.26.17.3 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 172.26.17.227 via 172.26.17.227 dev wlan0  proto babel onlink
> >> 192.168.7.0/30 dev eth1  proto kernel  scope link  src 192.168.7.1  metric 1
> >> 192.168.7.2 via 172.26.17.227 dev wlan0  proto babel onlink

Please note, that is is an end-node only feature. Normally, routers
don't do heavy tcp processing, thus using this feature on a router
wasn't considered by us. That's the same problematic like e.g.
tcp_quick_ack.

As soon as you have control over the application and it allows you to
bind to an interface via SO_BINDTODEVICE, you are able to select the
congestion control algorithm by using ip rule oif matching. But the
application could also chose the CC also by itself by using
'TCP_CONGESTION' setsockopt on a per-socket basis if you have source
access.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-05 21:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-05 15:24 [PATCH net-next 0/4] net: allow setting congctl via routing table Daniel Borkmann
2014-12-05 15:24 ` [PATCH net-next 1/4] net: tcp: refactor reinitialization of congestion control Daniel Borkmann
2014-12-05 15:24 ` [PATCH net-next 2/4] net: tcp: add key management to " Daniel Borkmann
2014-12-05 15:24 ` [PATCH net-next 3/4] net: tcp: add RTAX_CC_ALGO fib handling Daniel Borkmann
2014-12-05 15:24 ` [PATCH net-next 4/4] net: tcp: add per route congestion control Daniel Borkmann
2014-12-05 16:35 ` [PATCH net-next 0/4] net: allow setting congctl via routing table Dave Taht
2014-12-05 18:35   ` Hannes Frederic Sowa
2014-12-05 19:05     ` Dave Taht
2014-12-05 21:03       ` Hannes Frederic Sowa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.