netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device
@ 2020-10-21 19:47 Harshitha Ramamurthy
  2020-10-21 19:47 ` [RFC PATCH net-next 1/3] sock: Definition and general functions for dev_and_queue structure Harshitha Ramamurthy
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Harshitha Ramamurthy @ 2020-10-21 19:47 UTC (permalink / raw)
  To: netdev, davem, kuba
  Cc: tom, carolyn.wyborny, jacob.e.keller, amritha.nambiar,
	Harshitha Ramamurthy

In XPS, the transmit queue selected for a packet is saved in the associated
sock for the packet and is then used to avoid recalculating the queue
on subsequent sends. The problem is that the corresponding device is not
also recorded so that when the queue mapping is referenced it may
correspond to a different device than the sending one, resulting in an
incorrect queue being used for transmit. Particularly with xps_rxqs, this
can lead to non-deterministic behaviour as illustrated below.

Consider a case where xps_rxqs is configured and there is a difference
in number of Tx and Rx queues. Suppose we have 2 devices A and B. Device A
has 0-7 queues and device B has 0-15 queues. Packets are transmitted from
Device A but packets are received on B. For packets received on queue 0-7
of Device B, xps_rxqs will be applied for reply packets to transmit on
Device A's queues 0-7. However, when packets are received on queues
8-15 of Device B, normal XPS is used to reply packets when transmitting
from Device A. This leads to non-deterministic behaviour. The case where
there are fewer receive queues is even more insidious. Consider Device
A, the trasmitting device has queues 0-15 and Device B, the receiver
has queues 0-7. With xps_rxqs enabled, the packets will be received only
on queues 0-7 of Device B, but sent only on 0-7 queues of Device A
thereby causing a load imbalance.

This patch set fixes the issue by recording both the device (via
ifindex) and the queue in the sock mapping. The pair is set and
retrieved atomically. While retrieving the queue using the get
functions, we check if the ifindex held is the same as the ifindex
stored before returning the queue held. For instance during transmit,
we return a valid queue number only after checking if the ifindex stored
matches the device currently held.

This patch set contains:
	- Definition of dev_and_queue structure to hold the ifindex
	  and queue number
	- Generic functions to get, set, and clear dev_and_queue
	  structure
	- Change sk_tx_queue_{get,set,clear} to
	  sk_tx_dev_and_queue_{get,set,clear}
	- Modify callers of above to use new interface
	- Change sk_rx_queue_{get,set,clear} to 
          sk_rx_dev_and_queue_{get,set,clear}
        - Modify callers of above to use new interface

This patch set was tested as follows:
	- XPS with both xps_cpus and xps_rxqs works as expected
	- the Q index is calculated only once when picking a tx queue
	  per connection. For ex: in netdev_pick_tx

Tom Herbert (3):
  sock: Definition and general functions for dev_and_queue structure
  sock: Use dev_and_queue structure for RX queue mapping in sock
  sock: Use dev_and_queue structure for TX queue mapping in sock

 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |   6 +-
 drivers/net/hyperv/netvsc_drv.c               |   4 +-
 include/net/busy_poll.h                       |   2 +-
 include/net/request_sock.h                    |   2 +-
 include/net/sock.h                            | 107 ++++++++++++------
 net/core/dev.c                                |   6 +-
 net/core/filter.c                             |   7 +-
 net/core/sock.c                               |  10 +-
 net/ipv4/tcp_input.c                          |   2 +-
 9 files changed, 93 insertions(+), 53 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH net-next 1/3] sock: Definition and general functions for dev_and_queue structure
  2020-10-21 19:47 [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Harshitha Ramamurthy
@ 2020-10-21 19:47 ` Harshitha Ramamurthy
  2020-10-21 19:47 ` [RFC PATCH net-next 2/3] sock: Use dev_and_queue structure for RX queue mapping in sock Harshitha Ramamurthy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Harshitha Ramamurthy @ 2020-10-21 19:47 UTC (permalink / raw)
  To: netdev, davem, kuba
  Cc: tom, carolyn.wyborny, jacob.e.keller, amritha.nambiar,
	Harshitha Ramamurthy

From: Tom Herbert <tom@herbertland.com>

Add struct dev_and_queue which holds and ifindex and queue pair. Add
generic functions to get the queue for the ifindex held as well as
functions to set, clear the pair in a structure.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
---
 include/net/sock.h | 52 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a5c6ae78df77..9755a6cab1a1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -107,6 +107,16 @@ typedef struct {
 #endif
 } socket_lock_t;
 
+struct dev_and_queue {
+	union {
+		struct {
+			int ifindex;
+			u16 queue;
+		};
+		u64 val64;
+	};
+};
+
 struct sock;
 struct proto;
 struct net;
@@ -1791,6 +1801,46 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 	return __sk_receive_skb(sk, skb, nested, 1, true);
 }
 
+#define NO_QUEUE_MAPPING	USHRT_MAX
+
+static inline int __dev_and_queue_get(const struct dev_and_queue *idandq,
+				       int ifindex)
+{
+	struct dev_and_queue dandq;
+
+	dandq.val64 = idandq->val64;
+
+	if (dandq.ifindex == ifindex && dandq.queue != NO_QUEUE_MAPPING)
+		return dandq.queue;
+
+	return -1;
+}
+
+static inline void __dev_and_queue_set(struct dev_and_queue *odandq,
+				       int ifindex, int queue)
+{
+	struct dev_and_queue dandq;
+
+	/* queue_mapping accept only upto a 16-bit value */
+	if (WARN_ON_ONCE((unsigned short)queue >= USHRT_MAX))
+		return;
+
+	dandq.ifindex = ifindex;
+	dandq.queue = queue;
+
+	odandq->val64 = dandq.val64;
+}
+
+static inline void __dev_and_queue_clear(struct dev_and_queue *odandq)
+{
+	struct dev_and_queue dandq;
+
+	dandq.ifindex = -1;
+	dandq.queue = NO_QUEUE_MAPPING;
+
+	odandq->val64 = dandq.val64;
+}
+
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
 	/* sk_tx_queue_mapping accept only upto a 16-bit value */
@@ -1799,8 +1849,6 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 	sk->sk_tx_queue_mapping = tx_queue;
 }
 
-#define NO_QUEUE_MAPPING	USHRT_MAX
-
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
 	sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH net-next 2/3] sock: Use dev_and_queue structure for RX queue mapping in sock
  2020-10-21 19:47 [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Harshitha Ramamurthy
  2020-10-21 19:47 ` [RFC PATCH net-next 1/3] sock: Definition and general functions for dev_and_queue structure Harshitha Ramamurthy
@ 2020-10-21 19:47 ` Harshitha Ramamurthy
  2020-10-21 19:47 ` [RFC PATCH net-next 3/3] sock: Use dev_and_queue structure for TX " Harshitha Ramamurthy
  2020-10-22 17:38 ` [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Willem de Bruijn
  3 siblings, 0 replies; 5+ messages in thread
From: Harshitha Ramamurthy @ 2020-10-21 19:47 UTC (permalink / raw)
  To: netdev, davem, kuba
  Cc: tom, carolyn.wyborny, jacob.e.keller, amritha.nambiar,
	Harshitha Ramamurthy

From: Tom Herbert <tom@herbertland.com>

Replace sk_rx_queue_mapping with sk_rx_dev_and_queue_mapping and
change associated function to set, get, and clear mapping. This
patch ensures that the queue picked for transmit is correct by
setting the queue and ifindex and then retrieving the queue number
only if the ifindex matches the one stored.

Fixes: c6345ce7d361d ("net: Record receive queue number for a connection")
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
---
 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |  6 ++--
 include/net/busy_poll.h                       |  2 +-
 include/net/sock.h                            | 30 ++++++++-----------
 net/core/dev.c                                |  2 +-
 net/core/filter.c                             |  7 +++--
 net/core/sock.c                               |  2 +-
 net/ipv4/tcp_input.c                          |  2 +-
 7 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
index ccaccb9fc2f7..fe0756a6ada4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
@@ -548,9 +548,9 @@ void mlx5e_ktls_handle_ctx_completion(struct mlx5e_icosq_wqe_info *wi)
 	queue_work(rule->priv->tls->rx_wq, &rule->work);
 }
 
-static int mlx5e_ktls_sk_get_rxq(struct sock *sk)
+static int mlx5e_ktls_sk_get_rxq(struct sock *sk, struct net_device *dev)
 {
-	int rxq = sk_rx_queue_get(sk);
+	int rxq = sk_rx_dev_and_queue_get(sk, dev->ifindex);
 
 	if (unlikely(rxq == -1))
 		rxq = 0;
@@ -584,7 +584,7 @@ int mlx5e_ktls_add_rx(struct net_device *netdev, struct sock *sk,
 	priv_rx->crypto_info  =
 		*(struct tls12_crypto_info_aes_gcm_128 *)crypto_info;
 
-	rxq = mlx5e_ktls_sk_get_rxq(sk);
+	rxq = mlx5e_ktls_sk_get_rxq(sk, netdev);
 	priv_rx->rxq = rxq;
 	priv_rx->sk = sk;
 
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index b001fa91c14e..f04283c54bcd 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -128,7 +128,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	WRITE_ONCE(sk->sk_napi_id, skb->napi_id);
 #endif
-	sk_rx_queue_set(sk, skb);
+	sk_rx_dev_and_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 9755a6cab1a1..d47b310cf132 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -151,7 +151,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
- *	@skc_rx_queue_mapping: rx queue number for this connection
+ *	@skc_rx_dev_and_queue_mapping: rx ifindex/queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -237,7 +237,7 @@ struct sock_common {
 	};
 	unsigned short		skc_tx_queue_mapping;
 #ifdef CONFIG_XPS
-	unsigned short		skc_rx_queue_mapping;
+	struct dev_and_queue	skc_rx_dev_and_queue_mapping;
 #endif
 	union {
 		int		skc_incoming_cpu;
@@ -365,7 +365,7 @@ struct sock {
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 #ifdef CONFIG_XPS
-#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#define sk_rx_dev_and_queue_mapping	__sk_common.skc_rx_dev_and_queue_mapping
 #endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
@@ -1862,34 +1862,30 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return -1;
 }
 
-static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+static inline void sk_rx_dev_and_queue_set(struct sock *sk,
+					   const struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
-	if (skb_rx_queue_recorded(skb)) {
-		u16 rx_queue = skb_get_rx_queue(skb);
+	if (skb->dev && skb_rx_queue_recorded(skb)) {
+		int ifindex = skb->dev->ifindex;
 
-		if (WARN_ON_ONCE(rx_queue == NO_QUEUE_MAPPING))
-			return;
-
-		sk->sk_rx_queue_mapping = rx_queue;
+		__dev_and_queue_set(&sk->sk_rx_dev_and_queue_mapping, ifindex,
+				    skb_get_rx_queue(skb));
 	}
 #endif
 }
 
-static inline void sk_rx_queue_clear(struct sock *sk)
+static inline void sk_rx_dev_and_queue_clear(struct sock *sk)
 {
 #ifdef CONFIG_XPS
-	sk->sk_rx_queue_mapping = NO_QUEUE_MAPPING;
+	__dev_and_queue_clear(&sk->sk_rx_dev_and_queue_mapping);
 #endif
 }
 
 #ifdef CONFIG_XPS
-static inline int sk_rx_queue_get(const struct sock *sk)
+static inline int sk_rx_dev_and_queue_get(const struct sock *sk, int ifindex)
 {
-	if (sk && sk->sk_rx_queue_mapping != NO_QUEUE_MAPPING)
-		return sk->sk_rx_queue_mapping;
-
-	return -1;
+	return sk ? __dev_and_queue_get(&sk->sk_tx_dev_and_queue_mapping, ifindex) : -1;
 }
 #endif
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 751e5264fd49..83cd3ee801e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3939,7 +3939,7 @@ static int get_xps_queue(struct net_device *dev, struct net_device *sb_dev,
 
 	dev_maps = rcu_dereference(sb_dev->xps_rxqs_map);
 	if (dev_maps) {
-		int tci = sk_rx_queue_get(sk);
+		int tci = sk_rx_dev_and_queue_get(sk, dev->ifindex);
 
 		if (tci >= 0 && tci < dev->num_rx_queues)
 			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
diff --git a/net/core/filter.c b/net/core/filter.c
index c5e2a1c5fd8d..13f2aa2f3e04 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8749,11 +8749,12 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_sock, rx_queue_mapping):
 #ifdef CONFIG_XPS
 		*insn++ = BPF_LDX_MEM(
-			BPF_FIELD_SIZEOF(struct sock, sk_rx_queue_mapping),
+			BPF_FIELD_SIZEOF(struct sock,
+					 sk_rx_dev_and_queue_mapping),
 			si->dst_reg, si->src_reg,
-			bpf_target_off(struct sock, sk_rx_queue_mapping,
+			bpf_target_off(struct sock, sk_rx_dev_and_queue_mapping.queue,
 				       sizeof_field(struct sock,
-						    sk_rx_queue_mapping),
+						    sk_rx_dev_and_queue_mapping.queue),
 				       target_size));
 		*insn++ = BPF_JMP_IMM(BPF_JNE, si->dst_reg, NO_QUEUE_MAPPING,
 				      1);
diff --git a/net/core/sock.c b/net/core/sock.c
index 4e8729357122..1bb846672a34 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3019,7 +3019,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	WRITE_ONCE(sk->sk_pacing_shift, 10);
 	sk->sk_incoming_cpu = -1;
 
-	sk_rx_queue_clear(sk);
+	sk_rx_dev_and_queue_clear(sk);
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.rst for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 67f10d3ec240..c13c35ba55a7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6842,7 +6842,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_rsk(req)->syn_tos = TCP_SKB_CB(skb)->ip_dsfield;
 	tcp_openreq_init_rwin(req, sk, dst);
-	sk_rx_queue_set(req_to_sk(req), skb);
+	sk_rx_dev_and_queue_set(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH net-next 3/3] sock: Use dev_and_queue structure for TX queue mapping in sock
  2020-10-21 19:47 [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Harshitha Ramamurthy
  2020-10-21 19:47 ` [RFC PATCH net-next 1/3] sock: Definition and general functions for dev_and_queue structure Harshitha Ramamurthy
  2020-10-21 19:47 ` [RFC PATCH net-next 2/3] sock: Use dev_and_queue structure for RX queue mapping in sock Harshitha Ramamurthy
@ 2020-10-21 19:47 ` Harshitha Ramamurthy
  2020-10-22 17:38 ` [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Willem de Bruijn
  3 siblings, 0 replies; 5+ messages in thread
From: Harshitha Ramamurthy @ 2020-10-21 19:47 UTC (permalink / raw)
  To: netdev, davem, kuba
  Cc: tom, carolyn.wyborny, jacob.e.keller, amritha.nambiar,
	Harshitha Ramamurthy

From: Tom Herbert <tom@herbertland.com>

Replace sk_tx_queue_mapping with sk_tx_dev_and_queue_mapping and
change associated functions to set, get, and clear mapping. This
patch ensures that the queue picked for transmit is correct by
setting the queue and ifindex and then retriveing the queue number
only if the ifindex matches the one stored.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
---
 drivers/net/hyperv/netvsc_drv.c |  4 ++--
 include/net/request_sock.h      |  2 +-
 include/net/sock.h              | 31 +++++++++++++------------------
 net/core/dev.c                  |  4 ++--
 net/core/sock.c                 |  8 ++++----
 5 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 261e6e55a907..9f8a4efa39c0 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -307,7 +307,7 @@ static inline int netvsc_get_tx_queue(struct net_device *ndev,
 	/* If queue index changed record the new value */
 	if (q_idx != old_idx &&
 	    sk && sk_fullsock(sk) && rcu_access_pointer(sk->sk_dst_cache))
-		sk_tx_queue_set(sk, q_idx);
+		sk_tx_dev_and_queue_set(sk, ndev->ifindex, q_idx);
 
 	return q_idx;
 }
@@ -325,7 +325,7 @@ static inline int netvsc_get_tx_queue(struct net_device *ndev,
  */
 static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb)
 {
-	int q_idx = sk_tx_queue_get(skb->sk);
+	int q_idx = sk_tx_dev_and_queue_get(skb->sk, ndev->ifindex);
 
 	if (q_idx < 0 || skb->ooo_okay || q_idx >= ndev->real_num_tx_queues) {
 		/* If forwarding a packet, we use the recorded queue when
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..a663cc7c91b7 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -102,7 +102,7 @@ reqsk_alloc(const struct request_sock_ops *ops, struct sock *sk_listener,
 	req->rsk_ops = ops;
 	req_to_sk(req)->sk_prot = sk_listener->sk_prot;
 	sk_node_init(&req_to_sk(req)->sk_node);
-	sk_tx_queue_clear(req_to_sk(req));
+	sk_tx_dev_and_queue_clear(req_to_sk(req));
 	req->saved_syn = NULL;
 	req->num_timeout = 0;
 	req->num_retrans = 0;
diff --git a/include/net/sock.h b/include/net/sock.h
index d47b310cf132..4c469bbb47e4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,7 +150,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_cookie: socket's cookie value
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
- *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_tx_dev_and_queue_mapping: tx ifindex/queue number for this connection
  *	@skc_rx_dev_and_queue_mapping: rx ifindex/queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
@@ -235,7 +235,7 @@ struct sock_common {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	unsigned short		skc_tx_queue_mapping;
+	struct dev_and_queue	skc_tx_dev_and_queue_mapping;
 #ifdef CONFIG_XPS
 	struct dev_and_queue	skc_rx_dev_and_queue_mapping;
 #endif
@@ -363,7 +363,7 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
-#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#define sk_tx_dev_and_queue_mapping	__sk_common.skc_tx_dev_and_queue_mapping
 #ifdef CONFIG_XPS
 #define sk_rx_dev_and_queue_mapping	__sk_common.skc_rx_dev_and_queue_mapping
 #endif
@@ -1841,25 +1841,20 @@ static inline void __dev_and_queue_clear(struct dev_and_queue *odandq)
 	odandq->val64 = dandq.val64;
 }
 
-static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+static inline void sk_tx_dev_and_queue_set(struct sock *sk, int ifindex,
+					   int tx_queue)
 {
-	/* sk_tx_queue_mapping accept only upto a 16-bit value */
-	if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX))
-		return;
-	sk->sk_tx_queue_mapping = tx_queue;
+	__dev_and_queue_set(&sk->sk_tx_dev_and_queue_mapping, ifindex, tx_queue);
 }
 
-static inline void sk_tx_queue_clear(struct sock *sk)
+static inline void sk_tx_dev_and_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
+	__dev_and_queue_clear(&sk->sk_tx_dev_and_queue_mapping);
 }
 
-static inline int sk_tx_queue_get(const struct sock *sk)
+static inline int sk_tx_dev_and_queue_get(const struct sock *sk, int ifindex)
 {
-	if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING)
-		return sk->sk_tx_queue_mapping;
-
-	return -1;
+	return sk ? __dev_and_queue_get(&sk->sk_tx_dev_and_queue_mapping, ifindex) : -1;
 }
 
 static inline void sk_rx_dev_and_queue_set(struct sock *sk,
@@ -1984,7 +1979,7 @@ static inline void dst_negative_advice(struct sock *sk)
 
 		if (ndst != dst) {
 			rcu_assign_pointer(sk->sk_dst_cache, ndst);
-			sk_tx_queue_clear(sk);
+			sk_tx_dev_and_queue_clear(sk);
 			sk->sk_dst_pending_confirm = 0;
 		}
 	}
@@ -1995,7 +1990,7 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
 	struct dst_entry *old_dst;
 
-	sk_tx_queue_clear(sk);
+	sk_tx_dev_and_queue_clear(sk);
 	sk->sk_dst_pending_confirm = 0;
 	old_dst = rcu_dereference_protected(sk->sk_dst_cache,
 					    lockdep_sock_is_held(sk));
@@ -2008,7 +2003,7 @@ sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
 	struct dst_entry *old_dst;
 
-	sk_tx_queue_clear(sk);
+	sk_tx_dev_and_queue_clear(sk);
 	sk->sk_dst_pending_confirm = 0;
 	old_dst = xchg((__force struct dst_entry **)&sk->sk_dst_cache, dst);
 	dst_release(old_dst);
diff --git a/net/core/dev.c b/net/core/dev.c
index 83cd3ee801e8..6146e04eb965 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3982,7 +3982,7 @@ u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
 		     struct net_device *sb_dev)
 {
 	struct sock *sk = skb->sk;
-	int queue_index = sk_tx_queue_get(sk);
+	int queue_index = sk_tx_dev_and_queue_get(sk, dev->ifindex);
 
 	sb_dev = sb_dev ? : dev;
 
@@ -3996,7 +3996,7 @@ u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&
 		    rcu_access_pointer(sk->sk_dst_cache))
-			sk_tx_queue_set(sk, new_index);
+			sk_tx_dev_and_queue_set(sk, dev->ifindex, new_index);
 
 		queue_index = new_index;
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 1bb846672a34..93489966c749 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -531,7 +531,7 @@ struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 	struct dst_entry *dst = __sk_dst_get(sk);
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
-		sk_tx_queue_clear(sk);
+		sk_tx_dev_and_queue_clear(sk);
 		sk->sk_dst_pending_confirm = 0;
 		RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
 		dst_release(dst);
@@ -1671,7 +1671,7 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
 
 		if (!try_module_get(prot->owner))
 			goto out_free_sec;
-		sk_tx_queue_clear(sk);
+		sk_tx_dev_and_queue_clear(sk);
 	}
 
 	return sk;
@@ -1740,7 +1740,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		cgroup_sk_alloc(&sk->sk_cgrp_data);
 		sock_update_classid(&sk->sk_cgrp_data);
 		sock_update_netprioidx(&sk->sk_cgrp_data);
-		sk_tx_queue_clear(sk);
+		sk_tx_dev_and_queue_clear(sk);
 	}
 
 	return sk;
@@ -1964,7 +1964,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		 */
 		sk_refcnt_debug_inc(newsk);
 		sk_set_socket(newsk, NULL);
-		sk_tx_queue_clear(newsk);
+		sk_tx_dev_and_queue_clear(newsk);
 		RCU_INIT_POINTER(newsk->sk_wq, NULL);
 
 		if (newsk->sk_prot->sockets_allocated)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device
  2020-10-21 19:47 [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Harshitha Ramamurthy
                   ` (2 preceding siblings ...)
  2020-10-21 19:47 ` [RFC PATCH net-next 3/3] sock: Use dev_and_queue structure for TX " Harshitha Ramamurthy
@ 2020-10-22 17:38 ` Willem de Bruijn
  3 siblings, 0 replies; 5+ messages in thread
From: Willem de Bruijn @ 2020-10-22 17:38 UTC (permalink / raw)
  To: Harshitha Ramamurthy
  Cc: Network Development, David Miller, Jakub Kicinski, Tom Herbert,
	carolyn.wyborny, Keller, Jacob E, amritha.nambiar

On Wed, Oct 21, 2020 at 3:51 PM Harshitha Ramamurthy
<harshitha.ramamurthy@intel.com> wrote:
>
> In XPS, the transmit queue selected for a packet is saved in the associated
> sock for the packet and is then used to avoid recalculating the queue
> on subsequent sends. The problem is that the corresponding device is not
> also recorded so that when the queue mapping is referenced it may
> correspond to a different device than the sending one, resulting in an
> incorrect queue being used for transmit. Particularly with xps_rxqs, this
> can lead to non-deterministic behaviour as illustrated below.
>
> Consider a case where xps_rxqs is configured and there is a difference
> in number of Tx and Rx queues. Suppose we have 2 devices A and B. Device A
> has 0-7 queues and device B has 0-15 queues. Packets are transmitted from
> Device A but packets are received on B. For packets received on queue 0-7
> of Device B, xps_rxqs will be applied for reply packets to transmit on
> Device A's queues 0-7. However, when packets are received on queues
> 8-15 of Device B, normal XPS is used to reply packets when transmitting
> from Device A. This leads to non-deterministic behaviour. The case where
> there are fewer receive queues is even more insidious. Consider Device
> A, the trasmitting device has queues 0-15 and Device B, the receiver
> has queues 0-7. With xps_rxqs enabled, the packets will be received only
> on queues 0-7 of Device B, but sent only on 0-7 queues of Device A
> thereby causing a load imbalance.

So the issue is limited to xps_rxqs with multiple nics.

When do we need sk_tx_dev_and_queue_mapping (patch 3/3)? It is used in
netdev_pick_tx, but associations are reset on route change and
recomputed if queue_index would exceed the current device queue count.

> This patch set fixes the issue by recording both the device (via
> ifindex) and the queue in the sock mapping. The pair is set and
> retrieved atomically.

I guess this is the reason for the somewhat convoluted cast to u64
logic in patch 1/3. Is the assumption that 64-bit loads and stores are
atomic on all platforms? That is not correct.

Is atomicity even needed? For the purpose of load balancing it isn't.
Just adding a sk->rx_ifindex would be a lot simpler.

sk->sk_napi_id already uniquely identifies the device. Unfortunately,
dev_get_by_napi_id is not cheap (traverses a hashtable bucket). Though
purely for the purpose of load balancing this validation could be
sample based.

The rx ifindex is also already recorded for inet sockets in
rx_dst_ifindex, and the sk_rx_queue_get functions are limited to
those, so could conceivably use that. But it is derived from skb_iif,
which is overwritten with every reentry of __netif_receive_skb_core.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-10-22 17:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-21 19:47 [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Harshitha Ramamurthy
2020-10-21 19:47 ` [RFC PATCH net-next 1/3] sock: Definition and general functions for dev_and_queue structure Harshitha Ramamurthy
2020-10-21 19:47 ` [RFC PATCH net-next 2/3] sock: Use dev_and_queue structure for RX queue mapping in sock Harshitha Ramamurthy
2020-10-21 19:47 ` [RFC PATCH net-next 3/3] sock: Use dev_and_queue structure for TX " Harshitha Ramamurthy
2020-10-22 17:38 ` [RFC PATCH net-next 0/3] sock: Fix sock queue mapping to include device Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).