linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/4] ICMP flow improvements
@ 2019-10-29 13:50 Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 1/4] flow_dissector: add meaningful comments Matteo Croce
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 13:50 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller ,
	Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

This series improves the flow inspector handling of ICMP packets:
The first two patches just add some comments in the code which would have saved
me a few minutes of time, and refactor a piece of code.
The third one adds to the flow inspector the capability to extract the
Identifier field, if present, so echo requests and replies are classified
as part of the same flow.
The fourth patch uses the function introduced earlier to the bonding driver,
so echo replies can be balanced across bonding slaves.

v1 -> v2:
 - remove unused struct members
 - add an helper to check for the Id field
 - use a local flow_dissector_key in the bonding to avoid
   changing behaviour of the flow dissector

Matteo Croce (4):
  flow_dissector: add meaningful comments
  flow_dissector: skip the ICMP dissector for non ICMP packets
  flow_dissector: extract more ICMP information
  bonding: balance ICMP echoes in layer3+4 mode

 drivers/net/bonding/bond_main.c |  77 ++++++++++++++++++++---
 include/net/flow_dissector.h    |  20 +++---
 net/core/flow_dissector.c       | 108 +++++++++++++++++++++++---------
 3 files changed, 160 insertions(+), 45 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 1/4] flow_dissector: add meaningful comments
  2019-10-29 13:50 [PATCH net-next v2 0/4] ICMP flow improvements Matteo Croce
@ 2019-10-29 13:50 ` Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 2/4] flow_dissector: skip the ICMP dissector for non ICMP packets Matteo Croce
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 13:50 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller ,
	Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

Documents two piece of code which can't be understood at a glance.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c    | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 90bd210be060..7747af3cc500 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -282,6 +282,7 @@ struct flow_keys {
 	struct flow_dissector_key_vlan cvlan;
 	struct flow_dissector_key_keyid keyid;
 	struct flow_dissector_key_ports ports;
+	/* 'addrs' must be the last member */
 	struct flow_dissector_key_addrs addrs;
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 7c09d87d3269..affde70dad47 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1374,6 +1374,9 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 {
 	size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs);
 	BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32));
+	/* flow.addrs MUST be the last member in struct flow_keys because
+	 * different L3 protocols have different address length
+	 */
 	BUILD_BUG_ON(offsetof(typeof(*flow), addrs) !=
 		     sizeof(*flow) - sizeof(flow->addrs));
 
@@ -1421,6 +1424,9 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow)
 }
 EXPORT_SYMBOL(flow_get_u32_dst);
 
+/* Sort the source and destination IP (and the ports if the IP are the same),
+ * to have consistent hash within the two directions
+ */
 static inline void __flow_hash_consistentify(struct flow_keys *keys)
 {
 	int addr_diff, i;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 2/4] flow_dissector: skip the ICMP dissector for non ICMP packets
  2019-10-29 13:50 [PATCH net-next v2 0/4] ICMP flow improvements Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 1/4] flow_dissector: add meaningful comments Matteo Croce
@ 2019-10-29 13:50 ` Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 3/4] flow_dissector: extract more ICMP information Matteo Croce
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 13:50 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller ,
	Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

FLOW_DISSECTOR_KEY_ICMP is checked for every packet, not only ICMP ones.
Even if the test overhead is probably negligible, move the
ICMP dissector code under the big 'switch(ip_proto)' so it gets called
only for ICMP packets.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 net/core/flow_dissector.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index affde70dad47..6443fac65ce8 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -203,6 +203,25 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
 }
 EXPORT_SYMBOL(__skb_flow_get_ports);
 
+/* If FLOW_DISSECTOR_KEY_ICMP is set, get the Type and Code from an ICMP packet
+ * using skb_flow_get_be16().
+ */
+static void __skb_flow_dissect_icmp(const struct sk_buff *skb,
+				    struct flow_dissector *flow_dissector,
+				    void *target_container,
+				    void *data, int thoff, int hlen)
+{
+	struct flow_dissector_key_icmp *key_icmp;
+
+	if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ICMP))
+		return;
+
+	key_icmp = skb_flow_dissector_target(flow_dissector,
+					     FLOW_DISSECTOR_KEY_ICMP,
+					     target_container);
+	key_icmp->icmp = skb_flow_get_be16(skb, thoff, data, hlen);
+}
+
 void skb_flow_dissect_meta(const struct sk_buff *skb,
 			   struct flow_dissector *flow_dissector,
 			   void *target_container)
@@ -853,7 +872,6 @@ bool __skb_flow_dissect(const struct net *net,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
-	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
 	struct bpf_prog *attached = NULL;
@@ -1295,6 +1313,12 @@ bool __skb_flow_dissect(const struct net *net,
 				       data, nhoff, hlen);
 		break;
 
+	case IPPROTO_ICMP:
+	case IPPROTO_ICMPV6:
+		__skb_flow_dissect_icmp(skb, flow_dissector, target_container,
+					data, nhoff, hlen);
+		break;
+
 	default:
 		break;
 	}
@@ -1308,14 +1332,6 @@ bool __skb_flow_dissect(const struct net *net,
 							data, hlen);
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_ICMP)) {
-		key_icmp = skb_flow_dissector_target(flow_dissector,
-						     FLOW_DISSECTOR_KEY_ICMP,
-						     target_container);
-		key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data, hlen);
-	}
-
 	/* Process result of IP proto processing */
 	switch (fdret) {
 	case FLOW_DISSECT_RET_PROTO_AGAIN:
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 3/4] flow_dissector: extract more ICMP information
  2019-10-29 13:50 [PATCH net-next v2 0/4] ICMP flow improvements Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 1/4] flow_dissector: add meaningful comments Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 2/4] flow_dissector: skip the ICMP dissector for non ICMP packets Matteo Croce
@ 2019-10-29 13:50 ` Matteo Croce
  2019-10-29 13:50 ` [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode Matteo Croce
  2019-10-31  0:21 ` [PATCH net-next v2 0/4] ICMP flow improvements David Miller
  4 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 13:50 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller ,
	Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

The ICMP flow dissector currently parses only the Type and Code fields.
Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
is used to correlate packets.
Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
with a more complex function which populate this field.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 include/net/flow_dissector.h | 19 +++++----
 net/core/flow_dissector.c    | 74 ++++++++++++++++++++++++------------
 2 files changed, 61 insertions(+), 32 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 7747af3cc500..f8541d018848 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -6,6 +6,8 @@
 #include <linux/in6.h>
 #include <uapi/linux/if_ether.h>
 
+struct sk_buff;
+
 /**
  * struct flow_dissector_key_control:
  * @thoff: Transport header offset
@@ -156,19 +158,16 @@ struct flow_dissector_key_ports {
 
 /**
  * flow_dissector_key_icmp:
- *	@ports: type and code of ICMP header
- *		icmp: ICMP type (high) and code (low)
  *		type: ICMP type
  *		code: ICMP code
+ *		id:   session identifier
  */
 struct flow_dissector_key_icmp {
-	union {
-		__be16 icmp;
-		struct {
-			u8 type;
-			u8 code;
-		};
+	struct {
+		u8 type;
+		u8 code;
 	};
+	u16 id;
 };
 
 /**
@@ -282,6 +281,7 @@ struct flow_keys {
 	struct flow_dissector_key_vlan cvlan;
 	struct flow_dissector_key_keyid keyid;
 	struct flow_dissector_key_ports ports;
+	struct flow_dissector_key_icmp icmp;
 	/* 'addrs' must be the last member */
 	struct flow_dissector_key_addrs addrs;
 };
@@ -316,6 +316,9 @@ static inline bool flow_keys_have_l4(const struct flow_keys *keys)
 }
 
 u32 flow_hash_from_keys(struct flow_keys *keys);
+void skb_flow_get_icmp_tci(const struct sk_buff *skb,
+			   struct flow_dissector_key_icmp *key_icmp,
+			   void *data, int thoff, int hlen);
 
 static inline bool dissector_uses_key(const struct flow_dissector *flow_dissector,
 				      enum flow_dissector_key_id key_id)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 6443fac65ce8..0d014b81b269 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -147,27 +147,6 @@ int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 	mutex_unlock(&flow_dissector_mutex);
 	return 0;
 }
-/**
- * skb_flow_get_be16 - extract be16 entity
- * @skb: sk_buff to extract from
- * @poff: offset to extract at
- * @data: raw buffer pointer to the packet
- * @hlen: packet header length
- *
- * The function will try to retrieve a be32 entity at
- * offset poff
- */
-static __be16 skb_flow_get_be16(const struct sk_buff *skb, int poff,
-				void *data, int hlen)
-{
-	__be16 *u, _u;
-
-	u = __skb_header_pointer(skb, poff, sizeof(_u), data, hlen, &_u);
-	if (u)
-		return *u;
-
-	return 0;
-}
 
 /**
  * __skb_flow_get_ports - extract the upper layer ports and return them
@@ -203,8 +182,54 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
 }
 EXPORT_SYMBOL(__skb_flow_get_ports);
 
-/* If FLOW_DISSECTOR_KEY_ICMP is set, get the Type and Code from an ICMP packet
- * using skb_flow_get_be16().
+static bool icmp_has_id(u8 type)
+{
+	switch (type) {
+	case ICMP_ECHO:
+	case ICMP_ECHOREPLY:
+	case ICMP_TIMESTAMP:
+	case ICMP_TIMESTAMPREPLY:
+	case ICMPV6_ECHO_REQUEST:
+	case ICMPV6_ECHO_REPLY:
+		return true;
+	}
+
+	return false;
+}
+
+/**
+ * skb_flow_get_icmp_tci - extract ICMP(6) Type, Code and Identifier fields
+ * @skb: sk_buff to extract from
+ * @key_icmp: struct flow_dissector_key_icmp to fill
+ * @data: raw buffer pointer to the packet
+ * @toff: offset to extract at
+ * @hlen: packet header length
+ */
+void skb_flow_get_icmp_tci(const struct sk_buff *skb,
+			   struct flow_dissector_key_icmp *key_icmp,
+			   void *data, int thoff, int hlen)
+{
+	struct icmphdr *ih, _ih;
+
+	ih = __skb_header_pointer(skb, thoff, sizeof(_ih), data, hlen, &_ih);
+	if (!ih)
+		return;
+
+	key_icmp->type = ih->type;
+	key_icmp->code = ih->code;
+
+	/* As we use 0 to signal that the Id field is not present,
+	 * avoid confusion with packets without such field
+	 */
+	if (icmp_has_id(ih->type))
+		key_icmp->id = ih->un.echo.id ? : 1;
+	else
+		key_icmp->id = 0;
+}
+EXPORT_SYMBOL(skb_flow_get_icmp_tci);
+
+/* If FLOW_DISSECTOR_KEY_ICMP is set, dissect an ICMP packet
+ * using skb_flow_get_icmp_tci().
  */
 static void __skb_flow_dissect_icmp(const struct sk_buff *skb,
 				    struct flow_dissector *flow_dissector,
@@ -219,7 +244,8 @@ static void __skb_flow_dissect_icmp(const struct sk_buff *skb,
 	key_icmp = skb_flow_dissector_target(flow_dissector,
 					     FLOW_DISSECTOR_KEY_ICMP,
 					     target_container);
-	key_icmp->icmp = skb_flow_get_be16(skb, thoff, data, hlen);
+
+	skb_flow_get_icmp_tci(skb, key_icmp, data, thoff, hlen);
 }
 
 void skb_flow_dissect_meta(const struct sk_buff *skb,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 13:50 [PATCH net-next v2 0/4] ICMP flow improvements Matteo Croce
                   ` (2 preceding siblings ...)
  2019-10-29 13:50 ` [PATCH net-next v2 3/4] flow_dissector: extract more ICMP information Matteo Croce
@ 2019-10-29 13:50 ` Matteo Croce
  2019-10-29 18:35   ` Nikolay Aleksandrov
  2019-10-31  0:21 ` [PATCH net-next v2 0/4] ICMP flow improvements David Miller
  4 siblings, 1 reply; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 13:50 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller ,
	Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:

    # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
    # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64

But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:

    # ping -qc1 192.168.0.2
    0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64

Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:

    # ping -q 192.168.1.2 -c1
    0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
    # ping -q 192.168.1.2 -c1
    1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64

Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 drivers/net/bonding/bond_main.c | 77 ++++++++++++++++++++++++++++++---
 1 file changed, 70 insertions(+), 7 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 21d8fcc83c9c..3e496e746cc6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -200,6 +200,51 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
 
 unsigned int bond_net_id __read_mostly;
 
+static const struct flow_dissector_key flow_keys_bonding_keys[] = {
+	{
+		.key_id = FLOW_DISSECTOR_KEY_CONTROL,
+		.offset = offsetof(struct flow_keys, control),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_BASIC,
+		.offset = offsetof(struct flow_keys, basic),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_IPV4_ADDRS,
+		.offset = offsetof(struct flow_keys, addrs.v4addrs),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_IPV6_ADDRS,
+		.offset = offsetof(struct flow_keys, addrs.v6addrs),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_TIPC,
+		.offset = offsetof(struct flow_keys, addrs.tipckey),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_PORTS,
+		.offset = offsetof(struct flow_keys, ports),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_ICMP,
+		.offset = offsetof(struct flow_keys, icmp),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_VLAN,
+		.offset = offsetof(struct flow_keys, vlan),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_FLOW_LABEL,
+		.offset = offsetof(struct flow_keys, tags),
+	},
+	{
+		.key_id = FLOW_DISSECTOR_KEY_GRE_KEYID,
+		.offset = offsetof(struct flow_keys, keyid),
+	},
+};
+
+static struct flow_dissector flow_keys_bonding __read_mostly;
+
 /*-------------------------- Forward declarations ---------------------------*/
 
 static int bond_init(struct net_device *bond_dev);
@@ -3263,10 +3308,14 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
 	const struct iphdr *iph;
 	int noff, proto = -1;
 
-	if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23)
-		return skb_flow_dissect_flow_keys(skb, fk, 0);
+	if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) {
+		memset(fk, 0, sizeof(*fk));
+		return __skb_flow_dissect(NULL, skb, &flow_keys_bonding,
+					  fk, NULL, 0, 0, 0, 0);
+	}
 
 	fk->ports.ports = 0;
+	memset(&fk->icmp, 0, sizeof(fk->icmp));
 	noff = skb_network_offset(skb);
 	if (skb->protocol == htons(ETH_P_IP)) {
 		if (unlikely(!pskb_may_pull(skb, noff + sizeof(*iph))))
@@ -3286,8 +3335,14 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
 	} else {
 		return false;
 	}
-	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
-		fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
+	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0) {
+		if (proto == IPPROTO_ICMP || proto == IPPROTO_ICMPV6)
+			skb_flow_get_icmp_tci(skb, &fk->icmp, skb->data,
+					      skb_transport_offset(skb),
+					      skb_headlen(skb));
+		else
+			fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
+	}
 
 	return true;
 }
@@ -3314,10 +3369,14 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
 		return bond_eth_hash(skb);
 
 	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
-	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
+	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23) {
 		hash = bond_eth_hash(skb);
-	else
-		hash = (__force u32)flow.ports.ports;
+	} else {
+		if (flow.icmp.id)
+			memcpy(&hash, &flow.icmp, sizeof(hash));
+		else
+			memcpy(&hash, &flow.ports.ports, sizeof(hash));
+	}
 	hash ^= (__force u32)flow_get_u32_dst(&flow) ^
 		(__force u32)flow_get_u32_src(&flow);
 	hash ^= (hash >> 16);
@@ -4901,6 +4960,10 @@ static int __init bonding_init(void)
 			goto err;
 	}
 
+	skb_flow_dissector_init(&flow_keys_bonding,
+				flow_keys_bonding_keys,
+				ARRAY_SIZE(flow_keys_bonding_keys));
+
 	register_netdevice_notifier(&bond_netdev_notifier);
 out:
 	return res;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 13:50 ` [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode Matteo Croce
@ 2019-10-29 18:35   ` Nikolay Aleksandrov
  2019-10-29 18:41     ` Nikolay Aleksandrov
  2019-10-29 21:03     ` Eric Dumazet
  0 siblings, 2 replies; 18+ messages in thread
From: Nikolay Aleksandrov @ 2019-10-29 18:35 UTC (permalink / raw)
  To: Matteo Croce, netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

On 29/10/2019 15:50, Matteo Croce wrote:
> The bonding uses the L4 ports to balance flows between slaves. As the ICMP
> protocol has no ports, those packets are sent all to the same device:
> 
>     # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
>     # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
>     # ping -qc1 192.168.0.2
>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
>     # ping -qc1 192.168.0.2
>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
>     # ping -qc1 192.168.0.2
>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
> 
> But some ICMP packets have an Identifier field which is
> used to match packets within sessions, let's use this value in the hash
> function to balance these packets between bond slaves:
> 
>     # ping -qc1 192.168.0.2
>     0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
>     0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
>     # ping -qc1 192.168.0.2
>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
> 
> Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,

Also ?

> so we can balance pings encapsulated in a tunnel when using mode encap3+4:
> 
>     # ping -q 192.168.1.2 -c1
>     0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
>     0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
>     # ping -q 192.168.1.2 -c1
>     1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
>     1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
> 
> Signed-off-by: Matteo Croce <mcroce@redhat.com>
> ---
>  drivers/net/bonding/bond_main.c | 77 ++++++++++++++++++++++++++++++---
>  1 file changed, 70 insertions(+), 7 deletions(-)
> 

Hi Matteo,
Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
completely) in a deterministic way from user-space ?
For example the mark can be interpreted as a slave id in the bonding (should be
optional, to avoid breaking existing setups). ping already supports -m and
anything else can set it, this way it can be used to do monitoring for a specific
slave with any protocol and would be a much simpler change.
User-space can then implement any logic for the monitoring case and as a minor bonus
can monitor the slaves in parallel. And the opposite as well - if people don't want
these balanced for some reason, they wouldn't enable it.

Or maybe I've misunderstood why this change is needed. :)
It would actually be nice to include the use-case which brought this on
in the commit message.

Cheers,
 Nik


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 18:35   ` Nikolay Aleksandrov
@ 2019-10-29 18:41     ` Nikolay Aleksandrov
  2019-10-29 19:45       ` Matteo Croce
  2019-10-29 21:03     ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Nikolay Aleksandrov @ 2019-10-29 18:41 UTC (permalink / raw)
  To: Matteo Croce, netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

On 29/10/2019 20:35, Nikolay Aleksandrov wrote:
> On 29/10/2019 15:50, Matteo Croce wrote:
>> The bonding uses the L4 ports to balance flows between slaves. As the ICMP
>> protocol has no ports, those packets are sent all to the same device:
>>
>>     # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
>>     # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
>>     # ping -qc1 192.168.0.2
>>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
>>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
>>     # ping -qc1 192.168.0.2
>>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
>>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
>>     # ping -qc1 192.168.0.2
>>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
>>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
>>
>> But some ICMP packets have an Identifier field which is
>> used to match packets within sessions, let's use this value in the hash
>> function to balance these packets between bond slaves:
>>
>>     # ping -qc1 192.168.0.2
>>     0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
>>     0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
>>     # ping -qc1 192.168.0.2
>>     1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
>>     1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
>>
>> Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
> 
> Also ?
> 
>> so we can balance pings encapsulated in a tunnel when using mode encap3+4:
>>
>>     # ping -q 192.168.1.2 -c1
>>     0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
>>     0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
>>     # ping -q 192.168.1.2 -c1
>>     1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
>>     1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
>>
>> Signed-off-by: Matteo Croce <mcroce@redhat.com>
>> ---
>>  drivers/net/bonding/bond_main.c | 77 ++++++++++++++++++++++++++++++---
>>  1 file changed, 70 insertions(+), 7 deletions(-)
>>
> 
> Hi Matteo,
> Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
> completely) in a deterministic way from user-space ?
> For example the mark can be interpreted as a slave id in the bonding (should be
> optional, to avoid breaking existing setups). ping already supports -m and
> anything else can set it, this way it can be used to do monitoring for a specific
> slave with any protocol and would be a much simpler change.
> User-space can then implement any logic for the monitoring case and as a minor bonus
> can monitor the slaves in parallel. And the opposite as well - if people don't want
> these balanced for some reason, they wouldn't enable it.
> 

Ooh I just noticed you'd like to balance replies as well. Nevermind

> Or maybe I've misunderstood why this change is needed. :)
> It would actually be nice to include the use-case which brought this on
> in the commit message.
> 
> Cheers,
>  Nik
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 18:41     ` Nikolay Aleksandrov
@ 2019-10-29 19:45       ` Matteo Croce
  2019-10-29 20:07         ` Nikolay Aleksandrov
  0 siblings, 1 reply; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 19:45 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, LKML

On Tue, Oct 29, 2019 at 7:41 PM Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
>
> On 29/10/2019 20:35, Nikolay Aleksandrov wrote:
> > Hi Matteo,
> > Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
> > completely) in a deterministic way from user-space ?
> > For example the mark can be interpreted as a slave id in the bonding (should be
> > optional, to avoid breaking existing setups). ping already supports -m and
> > anything else can set it, this way it can be used to do monitoring for a specific
> > slave with any protocol and would be a much simpler change.
> > User-space can then implement any logic for the monitoring case and as a minor bonus
> > can monitor the slaves in parallel. And the opposite as well - if people don't want
> > these balanced for some reason, they wouldn't enable it.
> >
>
> Ooh I just noticed you'd like to balance replies as well. Nevermind
>

Also, the bonding could be in a router in the middle so no way to read the mark.

-- 
Matteo Croce
per aspera ad upstream


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 19:45       ` Matteo Croce
@ 2019-10-29 20:07         ` Nikolay Aleksandrov
  0 siblings, 0 replies; 18+ messages in thread
From: Nikolay Aleksandrov @ 2019-10-29 20:07 UTC (permalink / raw)
  To: Matteo Croce
  Cc: netdev, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, LKML

On 29/10/2019 21:45, Matteo Croce wrote:
> On Tue, Oct 29, 2019 at 7:41 PM Nikolay Aleksandrov
> <nikolay@cumulusnetworks.com> wrote:
>>
>> On 29/10/2019 20:35, Nikolay Aleksandrov wrote:
>>> Hi Matteo,
>>> Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
>>> completely) in a deterministic way from user-space ?
>>> For example the mark can be interpreted as a slave id in the bonding (should be
>>> optional, to avoid breaking existing setups). ping already supports -m and
>>> anything else can set it, this way it can be used to do monitoring for a specific
>>> slave with any protocol and would be a much simpler change.
>>> User-space can then implement any logic for the monitoring case and as a minor bonus
>>> can monitor the slaves in parallel. And the opposite as well - if people don't want
>>> these balanced for some reason, they wouldn't enable it.
>>>
>>
>> Ooh I just noticed you'd like to balance replies as well. Nevermind
>>
> 
> Also, the bonding could be in a router in the middle so no way to read the mark.
> 

Yeah, of course. I was just thinking from the host monitoring POV as I thought
that was the initial intent (reading the last set's discussion).

Anyway the patch looks good to me,
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 18:35   ` Nikolay Aleksandrov
  2019-10-29 18:41     ` Nikolay Aleksandrov
@ 2019-10-29 21:03     ` Eric Dumazet
  2019-10-29 21:50       ` Nikolay Aleksandrov
  2019-10-29 23:03       ` Matteo Croce
  1 sibling, 2 replies; 18+ messages in thread
From: Eric Dumazet @ 2019-10-29 21:03 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Matteo Croce, netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel



On 10/29/19 11:35 AM, Nikolay Aleksandrov wrote:

> Hi Matteo,
> Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
> completely) in a deterministic way from user-space ?
> For example the mark can be interpreted as a slave id in the bonding (should be
> optional, to avoid breaking existing setups). ping already supports -m and
> anything else can set it, this way it can be used to do monitoring for a specific
> slave with any protocol and would be a much simpler change.
> User-space can then implement any logic for the monitoring case and as a minor bonus
> can monitor the slaves in parallel. And the opposite as well - if people don't want
> these balanced for some reason, they wouldn't enable it.
> 

I kind of agree giving user more control. But I do not believe we need to use the mark
(this might be already used by other layers)

TCP uses sk->sk_hash to feed skb->hash.

Anything using skb_set_owner_w() is also using sk->sk_hash if set.

So presumably we could add a generic SO_TXHASH socket option to let user space
read/set this field.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 21:03     ` Eric Dumazet
@ 2019-10-29 21:50       ` Nikolay Aleksandrov
  2019-10-29 23:04         ` Eric Dumazet
  2019-10-29 23:03       ` Matteo Croce
  1 sibling, 1 reply; 18+ messages in thread
From: Nikolay Aleksandrov @ 2019-10-29 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Matteo Croce, netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

On 29/10/2019 23:03, Eric Dumazet wrote:
> 
> 
> On 10/29/19 11:35 AM, Nikolay Aleksandrov wrote:
> 
>> Hi Matteo,
>> Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
>> completely) in a deterministic way from user-space ?
>> For example the mark can be interpreted as a slave id in the bonding (should be
>> optional, to avoid breaking existing setups). ping already supports -m and
>> anything else can set it, this way it can be used to do monitoring for a specific
>> slave with any protocol and would be a much simpler change.
>> User-space can then implement any logic for the monitoring case and as a minor bonus
>> can monitor the slaves in parallel. And the opposite as well - if people don't want
>> these balanced for some reason, they wouldn't enable it.
>>
> 
> I kind of agree giving user more control. But I do not believe we need to use the mark
> (this might be already used by other layers)
> 
> TCP uses sk->sk_hash to feed skb->hash.
> 
> Anything using skb_set_owner_w() is also using sk->sk_hash if set.
> 
> So presumably we could add a generic SO_TXHASH socket option to let user space
> read/set this field.
> 

Right, I was just giving it as an example. Your suggestion sounds much better and
wouldn't interfere with other layers, plus we already use skb->hash in bond_xmit_hash()
and skb_set_owner_w() sets l4_hash if txhash is present which is perfect.

One thing - how do we deal with sk_rethink_txhash() ? I guess we'll need some way to
signal that the user specified the txhash and it is not to be recomputed ?
That can also be used to avoid the connect txhash set as well if SO_TXHASH was set prior
to the connect. It's quite late here, I'll look into it more tomorrow. :)

Thanks,
 Nik






^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 21:03     ` Eric Dumazet
  2019-10-29 21:50       ` Nikolay Aleksandrov
@ 2019-10-29 23:03       ` Matteo Croce
  2019-10-29 23:14         ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 23:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nikolay Aleksandrov, netdev, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, David S . Miller, Stanislav Fomichev,
	Daniel Borkmann, Song Liu, Alexei Starovoitov, Paul Blakey, LKML

On Tue, Oct 29, 2019 at 10:03 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 10/29/19 11:35 AM, Nikolay Aleksandrov wrote:
>
> > Hi Matteo,
> > Wouldn't it be more useful and simpler to use some field to choose the slave (override the hash
> > completely) in a deterministic way from user-space ?
> > For example the mark can be interpreted as a slave id in the bonding (should be
> > optional, to avoid breaking existing setups). ping already supports -m and
> > anything else can set it, this way it can be used to do monitoring for a specific
> > slave with any protocol and would be a much simpler change.
> > User-space can then implement any logic for the monitoring case and as a minor bonus
> > can monitor the slaves in parallel. And the opposite as well - if people don't want
> > these balanced for some reason, they wouldn't enable it.
> >
>
> I kind of agree giving user more control. But I do not believe we need to use the mark
> (this might be already used by other layers)
>
> TCP uses sk->sk_hash to feed skb->hash.
>
> Anything using skb_set_owner_w() is also using sk->sk_hash if set.
>
> So presumably we could add a generic SO_TXHASH socket option to let user space
> read/set this field.
>

Hi Eric,

this would work for locally generated echoes, but what about forwarded packets?
The point behind my changeset is to provide consistent results within
a session by using the same path for request and response,
but avoid all sessions flowing to the same path.
This should resemble what happens with TCP and UDP: different
connections, different port, probably a different path. And by doing
this in the flow dissector, other applications could benefit it.

Also, this should somewhat balance the traffic of a router forwarding
those packets. Maybe it's not so much in percentage, but in some
gateways be a considerable volume.

Regards,
-- 
Matteo Croce
per aspera ad upstream


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 21:50       ` Nikolay Aleksandrov
@ 2019-10-29 23:04         ` Eric Dumazet
  2019-10-30 12:48           ` Nikolay Aleksandrov
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2019-10-29 23:04 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Eric Dumazet, Matteo Croce, netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel



On 10/29/19 2:50 PM, Nikolay Aleksandrov wrote:

> Right, I was just giving it as an example. Your suggestion sounds much better and
> wouldn't interfere with other layers, plus we already use skb->hash in bond_xmit_hash()
> and skb_set_owner_w() sets l4_hash if txhash is present which is perfect.
> 
> One thing - how do we deal with sk_rethink_txhash() ? I guess we'll need some way to
> signal that the user specified the txhash and it is not to be recomputed ?
> That can also be used to avoid the connect txhash set as well if SO_TXHASH was set prior
> to the connect. It's quite late here, I'll look into it more tomorrow. :)

I guess that we have something similar with SO_RCVBUF/SO_SNDBUF

autotuning is disabled when/if they are used :

 SOCK_RCVBUF_LOCK & SOCK_SNDBUF_LOCK

We could add a SOCK_TXHASH_LOCK so that sk_rethink_txhash() does nothing if
user forced a TXHASH value.

Something like the following (probably not complete) patch.

diff --git a/include/net/sock.h b/include/net/sock.h
index 380312cc67a9d9ee8720eb2db82b1f7f8a5615ab..a8882738710eaa9d9d629e1207837a798401a594 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1354,6 +1354,7 @@ static inline int __sk_prot_rehash(struct sock *sk)
 #define SOCK_RCVBUF_LOCK       2
 #define SOCK_BINDADDR_LOCK     4
 #define SOCK_BINDPORT_LOCK     8
+#define SOCK_TXHASH_LOCK       16
 
 struct socket_alloc {
        struct socket socket;
@@ -1852,7 +1853,8 @@ static inline u32 net_tx_rndhash(void)
 
 static inline void sk_set_txhash(struct sock *sk)
 {
-       sk->sk_txhash = net_tx_rndhash();
+       if (!(sk->sk_userlocks & SOCK_TXHASH_LOCK))
+               sk->sk_txhash = net_tx_rndhash();
 }
 
 static inline void sk_rethink_txhash(struct sock *sk)
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 77f7c1638eb1ce7d3e143bbffd60056e472b1122..998be6ee7991de3a76d4ad33df3a38dbe791eae8 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -118,6 +118,7 @@
 #define SO_SNDTIMEO_NEW         67
 
 #define SO_DETACH_REUSEPORT_BPF 68
+#define SO_TXHASH              69
 
 #if !defined(__KERNEL__)
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 997b352c2a72ee39f00b102a553ac1191202b74f..85b85dffd462bc3b497e0432100ff24b759832e0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -770,6 +770,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
        case SO_BROADCAST:
                sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
                break;
+       case SO_TXHASH:
+               sk->sk_txhash = val;
+               sk->sk_userlocks |= SOCK_TXHASH_LOCK;
+               break;
        case SO_SNDBUF:
                /* Don't error on this BSD doesn't and if you think
                 * about it this is right. Otherwise apps have to
@@ -1249,6 +1253,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
                v.val = sock_flag(sk, SOCK_BROADCAST);
                break;
 
+       case SO_TXHASH:
+               v.val = sk->sk_txhash;
+               break;
+
        case SO_SNDBUF:
                v.val = sk->sk_sndbuf;
                break;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 23:03       ` Matteo Croce
@ 2019-10-29 23:14         ` Eric Dumazet
  2019-10-29 23:19           ` Matteo Croce
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2019-10-29 23:14 UTC (permalink / raw)
  To: Matteo Croce, Eric Dumazet
  Cc: Nikolay Aleksandrov, netdev, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, David S . Miller, Stanislav Fomichev,
	Daniel Borkmann, Song Liu, Alexei Starovoitov, Paul Blakey, LKML



On 10/29/19 4:03 PM, Matteo Croce wrote:

> Hi Eric,
> 
> this would work for locally generated echoes, but what about forwarded packets?
> The point behind my changeset is to provide consistent results within
> a session by using the same path for request and response,
> but avoid all sessions flowing to the same path.
> This should resemble what happens with TCP and UDP: different
> connections, different port, probably a different path. And by doing
> this in the flow dissector, other applications could benefit it.

In principle it is fine, but I was not sure of overall impact of your change
on performance for 99.9% of packets that are not ICMP :)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 23:14         ` Eric Dumazet
@ 2019-10-29 23:19           ` Matteo Croce
  2019-10-31 16:22             ` Matteo Croce
  0 siblings, 1 reply; 18+ messages in thread
From: Matteo Croce @ 2019-10-29 23:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nikolay Aleksandrov, netdev, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, David S . Miller, Stanislav Fomichev,
	Daniel Borkmann, Song Liu, Alexei Starovoitov, Paul Blakey, LKML

On Wed, Oct 30, 2019 at 12:14 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 10/29/19 4:03 PM, Matteo Croce wrote:
>
> > Hi Eric,
> >
> > this would work for locally generated echoes, but what about forwarded packets?
> > The point behind my changeset is to provide consistent results within
> > a session by using the same path for request and response,
> > but avoid all sessions flowing to the same path.
> > This should resemble what happens with TCP and UDP: different
> > connections, different port, probably a different path. And by doing
> > this in the flow dissector, other applications could benefit it.
>
> In principle it is fine, but I was not sure of overall impact of your change
> on performance for 99.9% of packets that are not ICMP :)
>

Good point. I didn't measure it (I will) but all the code additions
are under some if (proto == ICMP) or similar.
My guess is that performance shouldn't change for non ICMP traffic,
but I'm curious to test it.

-- 
Matteo Croce
per aspera ad upstream


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 23:04         ` Eric Dumazet
@ 2019-10-30 12:48           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 18+ messages in thread
From: Nikolay Aleksandrov @ 2019-10-30 12:48 UTC (permalink / raw)
  To: Eric Dumazet, Matteo Croce, netdev
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David S . Miller, Stanislav Fomichev, Daniel Borkmann, Song Liu,
	Alexei Starovoitov, Paul Blakey, linux-kernel

On 30/10/2019 01:04, Eric Dumazet wrote:
> 
> 
> On 10/29/19 2:50 PM, Nikolay Aleksandrov wrote:
> 
>> Right, I was just giving it as an example. Your suggestion sounds much better and
>> wouldn't interfere with other layers, plus we already use skb->hash in bond_xmit_hash()
>> and skb_set_owner_w() sets l4_hash if txhash is present which is perfect.
>>
>> One thing - how do we deal with sk_rethink_txhash() ? I guess we'll need some way to
>> signal that the user specified the txhash and it is not to be recomputed ?
>> That can also be used to avoid the connect txhash set as well if SO_TXHASH was set prior
>> to the connect. It's quite late here, I'll look into it more tomorrow. :)
> 
> I guess that we have something similar with SO_RCVBUF/SO_SNDBUF
> 
> autotuning is disabled when/if they are used :
> 
>  SOCK_RCVBUF_LOCK & SOCK_SNDBUF_LOCK
> 
> We could add a SOCK_TXHASH_LOCK so that sk_rethink_txhash() does nothing if
> user forced a TXHASH value.
> 
> Something like the following (probably not complete) patch.
> 

Actually I think it's ok. I had a similar change last night sans the userlocks.
I just built and tested a kernel with it successfully using the bonding.
The only case that doesn't seem to work is a raw socket without hdrincl, IIUC due to
the direct alloc_skb() (transhdrlen == 0) in ip_append_data().
Unless you have other concerns could you please submit it formally ?

> diff --git a/include/net/sock.h b/include/net/sock.h
> index 380312cc67a9d9ee8720eb2db82b1f7f8a5615ab..a8882738710eaa9d9d629e1207837a798401a594 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1354,6 +1354,7 @@ static inline int __sk_prot_rehash(struct sock *sk)
>  #define SOCK_RCVBUF_LOCK       2
>  #define SOCK_BINDADDR_LOCK     4
>  #define SOCK_BINDPORT_LOCK     8
> +#define SOCK_TXHASH_LOCK       16
>  
>  struct socket_alloc {
>         struct socket socket;
> @@ -1852,7 +1853,8 @@ static inline u32 net_tx_rndhash(void)
>  
>  static inline void sk_set_txhash(struct sock *sk)
>  {
> -       sk->sk_txhash = net_tx_rndhash();
> +       if (!(sk->sk_userlocks & SOCK_TXHASH_LOCK))
> +               sk->sk_txhash = net_tx_rndhash();
>  }
>  
>  static inline void sk_rethink_txhash(struct sock *sk)
> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> index 77f7c1638eb1ce7d3e143bbffd60056e472b1122..998be6ee7991de3a76d4ad33df3a38dbe791eae8 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -118,6 +118,7 @@
>  #define SO_SNDTIMEO_NEW         67
>  
>  #define SO_DETACH_REUSEPORT_BPF 68
> +#define SO_TXHASH              69
>  
>  #if !defined(__KERNEL__)
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 997b352c2a72ee39f00b102a553ac1191202b74f..85b85dffd462bc3b497e0432100ff24b759832e0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -770,6 +770,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>         case SO_BROADCAST:
>                 sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
>                 break;
> +       case SO_TXHASH:
> +               sk->sk_txhash = val;
> +               sk->sk_userlocks |= SOCK_TXHASH_LOCK;
> +               break;
>         case SO_SNDBUF:
>                 /* Don't error on this BSD doesn't and if you think
>                  * about it this is right. Otherwise apps have to
> @@ -1249,6 +1253,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
>                 v.val = sock_flag(sk, SOCK_BROADCAST);
>                 break;
>  
> +       case SO_TXHASH:
> +               v.val = sk->sk_txhash;
> +               break;
> +
>         case SO_SNDBUF:
>                 v.val = sk->sk_sndbuf;
>                 break;
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 0/4] ICMP flow improvements
  2019-10-29 13:50 [PATCH net-next v2 0/4] ICMP flow improvements Matteo Croce
                   ` (3 preceding siblings ...)
  2019-10-29 13:50 ` [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode Matteo Croce
@ 2019-10-31  0:21 ` David Miller
  4 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2019-10-31  0:21 UTC (permalink / raw)
  To: mcroce
  Cc: netdev, j.vosburgh, vfalico, andy, sdf, daniel, songliubraving,
	ast, paulb, linux-kernel

From: Matteo Croce <mcroce@redhat.com>
Date: Tue, 29 Oct 2019 14:50:49 +0100

> This series improves the flow inspector handling of ICMP packets:
> The first two patches just add some comments in the code which would have saved
> me a few minutes of time, and refactor a piece of code.
> The third one adds to the flow inspector the capability to extract the
> Identifier field, if present, so echo requests and replies are classified
> as part of the same flow.
> The fourth patch uses the function introduced earlier to the bonding driver,
> so echo replies can be balanced across bonding slaves.
> 
> v1 -> v2:
>  - remove unused struct members
>  - add an helper to check for the Id field
>  - use a local flow_dissector_key in the bonding to avoid
>    changing behaviour of the flow dissector

Series applied to net-next, thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode
  2019-10-29 23:19           ` Matteo Croce
@ 2019-10-31 16:22             ` Matteo Croce
  0 siblings, 0 replies; 18+ messages in thread
From: Matteo Croce @ 2019-10-31 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nikolay Aleksandrov, netdev, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, David S . Miller, Stanislav Fomichev,
	Daniel Borkmann, Song Liu, Alexei Starovoitov, Paul Blakey, LKML

On Wed, Oct 30, 2019 at 12:19 AM Matteo Croce <mcroce@redhat.com> wrote:
>
> On Wed, Oct 30, 2019 at 12:14 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 10/29/19 4:03 PM, Matteo Croce wrote:
> >
> > > Hi Eric,
> > >
> > > this would work for locally generated echoes, but what about forwarded packets?
> > > The point behind my changeset is to provide consistent results within
> > > a session by using the same path for request and response,
> > > but avoid all sessions flowing to the same path.
> > > This should resemble what happens with TCP and UDP: different
> > > connections, different port, probably a different path. And by doing
> > > this in the flow dissector, other applications could benefit it.
> >
> > In principle it is fine, but I was not sure of overall impact of your change
> > on performance for 99.9% of packets that are not ICMP :)
> >
>
> Good point. I didn't measure it (I will) but all the code additions
> are under some if (proto == ICMP) or similar.
> My guess is that performance shouldn't change for non ICMP traffic,
> but I'm curious to test it.
>

Indeed if there is some impact it's way below the measurement uncertainty.
I've bonded two veth pairs and added a tc drop to the peers, then
started mausezahn to generate UDP traffic.
Traffic is measured on the veth peers:

Stock 5.4-rc5:

rx: 261.5 Mbps 605.4 Kpps
rx: 261.2 Mbps 604.6 Kpps
rx: 261.6 Mbps 605.5 Kpps

patched:

rx: 261.4 Mbps 605.1 Kpps
rx: 261.1 Mbps 604.4 Kpps
rx: 260.3 Mbps 602.5 Kpps

perf top shows no significatn change in bond* and skb_flow* functions

Regards,
-- 
Matteo Croce
per aspera ad upstream


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-10-31 16:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29 13:50 [PATCH net-next v2 0/4] ICMP flow improvements Matteo Croce
2019-10-29 13:50 ` [PATCH net-next v2 1/4] flow_dissector: add meaningful comments Matteo Croce
2019-10-29 13:50 ` [PATCH net-next v2 2/4] flow_dissector: skip the ICMP dissector for non ICMP packets Matteo Croce
2019-10-29 13:50 ` [PATCH net-next v2 3/4] flow_dissector: extract more ICMP information Matteo Croce
2019-10-29 13:50 ` [PATCH net-next v2 4/4] bonding: balance ICMP echoes in layer3+4 mode Matteo Croce
2019-10-29 18:35   ` Nikolay Aleksandrov
2019-10-29 18:41     ` Nikolay Aleksandrov
2019-10-29 19:45       ` Matteo Croce
2019-10-29 20:07         ` Nikolay Aleksandrov
2019-10-29 21:03     ` Eric Dumazet
2019-10-29 21:50       ` Nikolay Aleksandrov
2019-10-29 23:04         ` Eric Dumazet
2019-10-30 12:48           ` Nikolay Aleksandrov
2019-10-29 23:03       ` Matteo Croce
2019-10-29 23:14         ` Eric Dumazet
2019-10-29 23:19           ` Matteo Croce
2019-10-31 16:22             ` Matteo Croce
2019-10-31  0:21 ` [PATCH net-next v2 0/4] ICMP flow improvements David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).