[PATCH net 0/2] Prevent extreme TSO parameters from stalling TX queues

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net 0/2] Prevent extreme TSO parameters from stalling TX queues
@ 2012-07-30 17:14 Ben Hutchings
  2012-07-30 17:16 ` [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb Ben Hutchings
  2012-07-30 17:17 ` [PATCH net 2/2] sfc: Correct the minimum TX queue size Ben Hutchings
  0 siblings, 2 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 17:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

The following changes fix a potential DoS by peers or local users on
network interfaces using the sfc driver (and possibly others) with TSO
enabled (as it is by default).  Please apply them to the net tree and
your stable update queue.

Ben.

Ben Hutchings (2):
  tcp: Limit number of segments generated by GSO per skb
  sfc: Correct the minimum TX queue size

 drivers/net/ethernet/sfc/efx.c     |    5 +++++
 drivers/net/ethernet/sfc/efx.h     |   11 +++++++----
 drivers/net/ethernet/sfc/ethtool.c |   16 +++++++++++-----
 drivers/net/ethernet/sfc/tx.c      |   20 ++++++++++++++++++++
 include/net/tcp.h                  |    3 +++
 net/ipv4/tcp.c                     |    4 +++-
 net/ipv4/tcp_output.c              |   17 +++++++++--------
 7 files changed, 58 insertions(+), 18 deletions(-)

-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 17:14 [PATCH net 0/2] Prevent extreme TSO parameters from stalling TX queues Ben Hutchings
@ 2012-07-30 17:16 ` Ben Hutchings
  2012-07-30 17:23   ` Ben Greear
  2012-07-30 17:31   ` Eric Dumazet
  2012-07-30 17:17 ` [PATCH net 2/2] sfc: Correct the minimum TX queue size Ben Hutchings
  1 sibling, 2 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 17:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

A peer (or local user) may cause TCP to use a nominal MSS of as little
as 88 (actual MSS of 76 with timestamps).  Given that we have a
sufficiently prodigious local sender and the peer ACKs quickly enough,
it is nevertheless possible to grow the window for such a connection
to the point that we will try to send just under 64K at once.  This
results in a single skb that expands to 861 segments.

In some drivers with TSO support, such an skb will require hundreds of
DMA descriptors; a substantial fraction of a TX ring or even more than
a full ring.  The TX queue selected for the skb may stall and trigger
the TX watchdog repeatedly (since the problem skb will be retried
after the TX reset).  This particularly affects sfc, for which the
issue is designated as CVE-2012-3412.  However it may be that some
hardware or firmware also fails to handle such an extreme TSO request
correctly.

Therefore, limit the number of segments per skb to 100.  This should
make no difference to behaviour unless the actual MSS is less than
about 700.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/net/tcp.h     |    3 +++
 net/ipv4/tcp.c        |    4 +++-
 net/ipv4/tcp_output.c |   17 +++++++++--------
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e19124b..098a2d0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -70,6 +70,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* The least MTU to use for probing */
 #define TCP_BASE_MSS		512
 
+/* Maximum number of segments we may require GSO to generate from an skb. */
+#define TCP_MAX_GSO_SEGS	100
+
 /* After receiving this amount of duplicate ACKs fast retransmit starts. */
 #define TCP_FASTRETRANS_THRESH 3
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e7e6eea..51d8daf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -811,7 +811,9 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 			   old_size_goal + mss_now > xmit_size_goal)) {
 			xmit_size_goal = old_size_goal;
 		} else {
-			tp->xmit_size_goal_segs = xmit_size_goal / mss_now;
+			tp->xmit_size_goal_segs =
+				min_t(u32, xmit_size_goal / mss_now,
+				      TCP_MAX_GSO_SEGS);
 			xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
 		}
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 33cd065..c86c288 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1522,21 +1522,21 @@ static void tcp_cwnd_validate(struct sock *sk)
  * when we would be allowed to send the split-due-to-Nagle skb fully.
  */
 static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb,
-					unsigned int mss_now, unsigned int cwnd)
+					unsigned int mss_now, unsigned int max_segs)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
-	u32 needed, window, cwnd_len;
+	u32 needed, window, max_len;
 
 	window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
-	cwnd_len = mss_now * cwnd;
+	max_len = mss_now * max_segs;
 
-	if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
-		return cwnd_len;
+	if (likely(max_len <= window && skb != tcp_write_queue_tail(sk)))
+		return max_len;
 
 	needed = min(skb->len, window);
 
-	if (cwnd_len <= needed)
-		return cwnd_len;
+	if (max_len <= needed)
+		return max_len;
 
 	return needed - needed % mss_now;
 }
@@ -1999,7 +1999,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    cwnd_quota);
+						    min(cwnd_quota,
+							TCP_MAX_GSO_SEGS));
 
 		if (skb->len > limit &&
 		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 2/2] sfc: Correct the minimum TX queue size
  2012-07-30 17:14 [PATCH net 0/2] Prevent extreme TSO parameters from stalling TX queues Ben Hutchings
  2012-07-30 17:16 ` [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb Ben Hutchings
@ 2012-07-30 17:17 ` Ben Hutchings
  1 sibling, 0 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 17:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

Currently an skb requiring TSO may not fit within a minimum-size TX
queue.  The TX queue selected for the skb may stall and trigger the TX
watchdog repeatedly (since the problem skb will be retried after the
TX reset).  This issue is designated as CVE-2012-3412.

The preceding change 'tcp: Limit number of segments generated by GSO
per skb' fixed this for the default queue size.  Increase the minimum
to allow for 2 worst-case skbs, such that there will definitely be
space to add an skb after we wake a queue.

To avoid invalidating existing configurations, change
efx_ethtool_set_ringparam() to fix up values that are too small rather
than returning -EINVAL.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c     |    5 +++++
 drivers/net/ethernet/sfc/efx.h     |   11 +++++++----
 drivers/net/ethernet/sfc/ethtool.c |   16 +++++++++++-----
 drivers/net/ethernet/sfc/tx.c      |   20 ++++++++++++++++++++
 4 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 70554a1..a5b5a75 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1503,6 +1503,11 @@ static int efx_probe_all(struct efx_nic *efx)
 		goto fail2;
 	}
 
+	BUILD_BUG_ON(EFX_DEFAULT_DMAQ_SIZE < EFX_RXQ_MIN_ENT);
+	if (WARN_ON(EFX_DEFAULT_DMAQ_SIZE < EFX_TXQ_MIN_ENT(efx))) {
+		rc = -EINVAL;
+		goto fail3;
+	}
 	efx->rxq_entries = efx->txq_entries = EFX_DEFAULT_DMAQ_SIZE;
 
 	rc = efx_probe_filters(efx);
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index be8f915..d95309c 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -30,6 +30,7 @@ extern netdev_tx_t
 efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
 extern void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
 extern int efx_setup_tc(struct net_device *net_dev, u8 num_tc);
+extern unsigned int efx_tx_max_skb_descs(struct efx_nic *efx);
 
 /* RX */
 extern int efx_probe_rx_queue(struct efx_rx_queue *rx_queue);
@@ -52,10 +53,12 @@ extern void efx_schedule_slow_fill(struct efx_rx_queue *rx_queue);
 #define EFX_MAX_EVQ_SIZE 16384UL
 #define EFX_MIN_EVQ_SIZE 512UL
 
-/* The smallest [rt]xq_entries that the driver supports. Callers of
- * efx_wake_queue() assume that they can subsequently send at least one
- * skb. Falcon/A1 may require up to three descriptors per skb_frag. */
-#define EFX_MIN_RING_SIZE (roundup_pow_of_two(2 * 3 * MAX_SKB_FRAGS))
+/* The smallest [rt]xq_entries that the driver supports.  RX minimum
+ * is a bit arbitrary.  For TX, we must have space for at least 2
+ * TSO skbs.
+ */
+#define EFX_RXQ_MIN_ENT		128U
+#define EFX_TXQ_MIN_ENT(efx)	(2 * efx_tx_max_skb_descs(efx))
 
 /* Filters */
 extern int efx_probe_filters(struct efx_nic *efx);
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index 10536f9..8cba2df 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -680,21 +680,27 @@ static int efx_ethtool_set_ringparam(struct net_device *net_dev,
 				     struct ethtool_ringparam *ring)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
+	u32 txq_entries;
 
 	if (ring->rx_mini_pending || ring->rx_jumbo_pending ||
 	    ring->rx_pending > EFX_MAX_DMAQ_SIZE ||
 	    ring->tx_pending > EFX_MAX_DMAQ_SIZE)
 		return -EINVAL;
 
-	if (ring->rx_pending < EFX_MIN_RING_SIZE ||
-	    ring->tx_pending < EFX_MIN_RING_SIZE) {
+	if (ring->rx_pending < EFX_RXQ_MIN_ENT) {
 		netif_err(efx, drv, efx->net_dev,
-			  "TX and RX queues cannot be smaller than %ld\n",
-			  EFX_MIN_RING_SIZE);
+			  "RX queues cannot be smaller than %u\n",
+			  EFX_RXQ_MIN_ENT);
 		return -EINVAL;
 	}
 
-	return efx_realloc_channels(efx, ring->rx_pending, ring->tx_pending);
+	txq_entries = max(ring->tx_pending, EFX_TXQ_MIN_ENT(efx));
+	if (txq_entries != ring->tx_pending)
+		netif_warn(efx, drv, efx->net_dev,
+			   "increasing TX queue size to minimum of %u\n",
+			   txq_entries);
+
+	return efx_realloc_channels(efx, ring->rx_pending, txq_entries);
 }
 
 static int efx_ethtool_set_pauseparam(struct net_device *net_dev,
diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index 9b225a7..814bd17 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -15,6 +15,7 @@
 #include <linux/ipv6.h>
 #include <linux/slab.h>
 #include <net/ipv6.h>
+#include <net/tcp.h>
 #include <linux/if_ether.h>
 #include <linux/highmem.h>
 #include "net_driver.h"
@@ -119,6 +120,25 @@ efx_max_tx_len(struct efx_nic *efx, dma_addr_t dma_addr)
 	return len;
 }
 
+unsigned int efx_tx_max_skb_descs(struct efx_nic *efx)
+{
+	/* Header and payload descriptor for each output segment, plus
+	 * one for every input fragment boundary within a segment
+	 */
+	unsigned int max_descs = TCP_MAX_GSO_SEGS * 2 + MAX_SKB_FRAGS;
+
+	/* Possibly one more per segment for the alignment workaround */
+	if (EFX_WORKAROUND_5391(efx))
+		max_descs += TCP_MAX_GSO_SEGS;
+
+	/* Possibly more for PCIe page boundaries within input fragments */
+	if (PAGE_SIZE > EFX_PAGE_SIZE)
+		max_descs += max_t(unsigned int, MAX_SKB_FRAGS,
+				   DIV_ROUND_UP(GSO_MAX_SIZE, EFX_PAGE_SIZE));
+
+	return max_descs;
+}
+
 /*
  * Add a socket buffer to a TX queue
  *
-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 17:16 ` [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb Ben Hutchings
@ 2012-07-30 17:23   ` Ben Greear
  2012-07-30 19:41     ` Ben Hutchings
  2012-07-30 17:31   ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: Ben Greear @ 2012-07-30 17:23 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, linux-net-drivers

On 07/30/2012 10:16 AM, Ben Hutchings wrote:
> A peer (or local user) may cause TCP to use a nominal MSS of as little
> as 88 (actual MSS of 76 with timestamps).  Given that we have a
> sufficiently prodigious local sender and the peer ACKs quickly enough,
> it is nevertheless possible to grow the window for such a connection
> to the point that we will try to send just under 64K at once.  This
> results in a single skb that expands to 861 segments.
>
> In some drivers with TSO support, such an skb will require hundreds of
> DMA descriptors; a substantial fraction of a TX ring or even more than
> a full ring.  The TX queue selected for the skb may stall and trigger
> the TX watchdog repeatedly (since the problem skb will be retried
> after the TX reset).  This particularly affects sfc, for which the
> issue is designated as CVE-2012-3412.  However it may be that some
> hardware or firmware also fails to handle such an extreme TSO request
> correctly.
>
> Therefore, limit the number of segments per skb to 100.  This should
> make no difference to behaviour unless the actual MSS is less than
> about 700.

Please do not do this...or at least allow over-rides.  We love
the trick of seting very small MSS and making the NICs generate
huge numbers of small TCP frames with efficient user-space
logic.   We use this for stateful TCP load testing when high
numbers of tcp packets-per-second is desired.

Intel NICs, including 10G, work just fine with minimal MSS
in this scenario.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 17:16 ` [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb Ben Hutchings
  2012-07-30 17:23   ` Ben Greear
@ 2012-07-30 17:31   ` Eric Dumazet
  2012-07-30 19:35     ` Ben Hutchings
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-07-30 17:31 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, linux-net-drivers

On Mon, 2012-07-30 at 18:16 +0100, Ben Hutchings wrote:
> A peer (or local user) may cause TCP to use a nominal MSS of as little
> as 88 (actual MSS of 76 with timestamps).  Given that we have a
> sufficiently prodigious local sender and the peer ACKs quickly enough,
> it is nevertheless possible to grow the window for such a connection
> to the point that we will try to send just under 64K at once.  This
> results in a single skb that expands to 861 segments.
> 
> In some drivers with TSO support, such an skb will require hundreds of
> DMA descriptors; a substantial fraction of a TX ring or even more than
> a full ring.  The TX queue selected for the skb may stall and trigger
> the TX watchdog repeatedly (since the problem skb will be retried
> after the TX reset).  This particularly affects sfc, for which the
> issue is designated as CVE-2012-3412.  However it may be that some
> hardware or firmware also fails to handle such an extreme TSO request
> correctly.
> 
> Therefore, limit the number of segments per skb to 100.  This should
> make no difference to behaviour unless the actual MSS is less than
> about 700.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---


Hmm, isnt GRO path also vulnerable ?

An alternative would be to drop such frames in the ndo_start_xmit(), and
cap sk->sk_gso_max_size (since skb are no longer orphaned...)

Or you could introduce a new wk->sk_gso_max_segments, that your sfc
driver sets to whatever limit ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 17:31   ` Eric Dumazet
@ 2012-07-30 19:35     ` Ben Hutchings
  2012-07-30 19:56       ` Ben Hutchings
  2012-07-30 21:46       ` David Miller
  0 siblings, 2 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 19:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, linux-net-drivers

On Mon, 2012-07-30 at 19:31 +0200, Eric Dumazet wrote:
> On Mon, 2012-07-30 at 18:16 +0100, Ben Hutchings wrote:
> > A peer (or local user) may cause TCP to use a nominal MSS of as little
> > as 88 (actual MSS of 76 with timestamps).  Given that we have a
> > sufficiently prodigious local sender and the peer ACKs quickly enough,
> > it is nevertheless possible to grow the window for such a connection
> > to the point that we will try to send just under 64K at once.  This
> > results in a single skb that expands to 861 segments.
> > 
> > In some drivers with TSO support, such an skb will require hundreds of
> > DMA descriptors; a substantial fraction of a TX ring or even more than
> > a full ring.  The TX queue selected for the skb may stall and trigger
> > the TX watchdog repeatedly (since the problem skb will be retried
> > after the TX reset).  This particularly affects sfc, for which the
> > issue is designated as CVE-2012-3412.  However it may be that some
> > hardware or firmware also fails to handle such an extreme TSO request
> > correctly.
> > 
> > Therefore, limit the number of segments per skb to 100.  This should
> > make no difference to behaviour unless the actual MSS is less than
> > about 700.
> > 
> > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> > ---
> 
> 
> Hmm, isnt GRO path also vulnerable ?

You mean, for forwarding?  If page fragments are used, the number of
segments is limited to MAX_SKB_FRAGS < 100.  But if skbs are aggregated
and build_skb() is not used (e.g. due to jumbo MTU) it appears we would
need an explicit limit.  Something like this:

---
From: Ben Hutchings <bhutchings@solarflare.com>
Subject: [PATCH net] tcp: Limit number of segments merged by GRO

In the case where GRO aggregates skbs that cannot be converted to
page-fragments, there is currently no limit to the number of
segments that may be merged and subsequently re-segmented by GSO.

Apply the same limit as was introduced for locally-generated GSO skbs
in 'tcp: Limit number of segments generated by GSO per skb'.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 net/ipv4/tcp.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 51d8daf..a052d07 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3144,7 +3144,8 @@ out_check_final:
 					TCP_FLAG_RST | TCP_FLAG_SYN |
 					TCP_FLAG_FIN));
 
-	if (p && (!NAPI_GRO_CB(skb)->same_flow || flush))
+	if (p && (!NAPI_GRO_CB(skb)->same_flow || flush ||
+		  NAPI_GRO_CB(p)->count == TCP_MAX_GSO_SEGS))
 		pp = head;
 
 out:
---

> An alternative would be to drop such frames in the ndo_start_xmit(), and
> cap sk->sk_gso_max_size (since skb are no longer orphaned...)

I have implemented that workaround for the out-of-tree version of sfc.
For the in-tree driver, I thought it would be better to limit the number
of segments at source, which will avoid penalising any cases where the
window can grow so much larger than MSS.

> Or you could introduce a new wk->sk_gso_max_segments, that your sfc
> driver sets to whatever limit ?

Yes, that's another option.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 17:23   ` Ben Greear
@ 2012-07-30 19:41     ` Ben Hutchings
  2012-07-30 21:00       ` Ben Greear
  0 siblings, 1 reply; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 19:41 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, netdev, linux-net-drivers

On Mon, 2012-07-30 at 10:23 -0700, Ben Greear wrote:
> On 07/30/2012 10:16 AM, Ben Hutchings wrote:
> > A peer (or local user) may cause TCP to use a nominal MSS of as little
> > as 88 (actual MSS of 76 with timestamps).  Given that we have a
> > sufficiently prodigious local sender and the peer ACKs quickly enough,
> > it is nevertheless possible to grow the window for such a connection
> > to the point that we will try to send just under 64K at once.  This
> > results in a single skb that expands to 861 segments.
> >
> > In some drivers with TSO support, such an skb will require hundreds of
> > DMA descriptors; a substantial fraction of a TX ring or even more than
> > a full ring.  The TX queue selected for the skb may stall and trigger
> > the TX watchdog repeatedly (since the problem skb will be retried
> > after the TX reset).  This particularly affects sfc, for which the
> > issue is designated as CVE-2012-3412.  However it may be that some
> > hardware or firmware also fails to handle such an extreme TSO request
> > correctly.
> >
> > Therefore, limit the number of segments per skb to 100.  This should
> > make no difference to behaviour unless the actual MSS is less than
> > about 700.
> 
> Please do not do this...or at least allow over-rides.  We love
> the trick of seting very small MSS and making the NICs generate
> huge numbers of small TCP frames with efficient user-space
> logic.   We use this for stateful TCP load testing when high
> numbers of tcp packets-per-second is desired.

Please test whether this actually makes a difference - my suspicion is
that 100 segments per skb is easily enough to prevent the host being a
bottleneck.

> Intel NICs, including 10G, work just fine with minimal MSS
> in this scenario.

I'll leave this to the Intel maintainers to answer.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 19:35     ` Ben Hutchings
@ 2012-07-30 19:56       ` Ben Hutchings
  2012-07-30 21:46       ` David Miller
  1 sibling, 0 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 19:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, linux-net-drivers

On Mon, 2012-07-30 at 20:35 +0100, Ben Hutchings wrote:
> On Mon, 2012-07-30 at 19:31 +0200, Eric Dumazet wrote:
> > On Mon, 2012-07-30 at 18:16 +0100, Ben Hutchings wrote:
> > > A peer (or local user) may cause TCP to use a nominal MSS of as little
> > > as 88 (actual MSS of 76 with timestamps).  Given that we have a
> > > sufficiently prodigious local sender and the peer ACKs quickly enough,
> > > it is nevertheless possible to grow the window for such a connection
> > > to the point that we will try to send just under 64K at once.  This
> > > results in a single skb that expands to 861 segments.
> > > 
> > > In some drivers with TSO support, such an skb will require hundreds of
> > > DMA descriptors; a substantial fraction of a TX ring or even more than
> > > a full ring.  The TX queue selected for the skb may stall and trigger
> > > the TX watchdog repeatedly (since the problem skb will be retried
> > > after the TX reset).  This particularly affects sfc, for which the
> > > issue is designated as CVE-2012-3412.  However it may be that some
> > > hardware or firmware also fails to handle such an extreme TSO request
> > > correctly.
> > > 
> > > Therefore, limit the number of segments per skb to 100.  This should
> > > make no difference to behaviour unless the actual MSS is less than
> > > about 700.
[...]
> > An alternative would be to drop such frames in the ndo_start_xmit(), and
> > cap sk->sk_gso_max_size (since skb are no longer orphaned...)
> 
> I have implemented that workaround for the out-of-tree version of sfc.
> For the in-tree driver, I thought it would be better to limit the number
> of segments at source, which will avoid penalising any cases where the
> window can grow so much larger than MSS.
[...]

I mean any *legitimate* cases where this can happen.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 19:41     ` Ben Hutchings
@ 2012-07-30 21:00       ` Ben Greear
  0 siblings, 0 replies; 13+ messages in thread
From: Ben Greear @ 2012-07-30 21:00 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, linux-net-drivers

On 07/30/2012 12:41 PM, Ben Hutchings wrote:
> On Mon, 2012-07-30 at 10:23 -0700, Ben Greear wrote:
>> On 07/30/2012 10:16 AM, Ben Hutchings wrote:
>>> A peer (or local user) may cause TCP to use a nominal MSS of as little
>>> as 88 (actual MSS of 76 with timestamps).  Given that we have a
>>> sufficiently prodigious local sender and the peer ACKs quickly enough,
>>> it is nevertheless possible to grow the window for such a connection
>>> to the point that we will try to send just under 64K at once.  This
>>> results in a single skb that expands to 861 segments.
>>>
>>> In some drivers with TSO support, such an skb will require hundreds of
>>> DMA descriptors; a substantial fraction of a TX ring or even more than
>>> a full ring.  The TX queue selected for the skb may stall and trigger
>>> the TX watchdog repeatedly (since the problem skb will be retried
>>> after the TX reset).  This particularly affects sfc, for which the
>>> issue is designated as CVE-2012-3412.  However it may be that some
>>> hardware or firmware also fails to handle such an extreme TSO request
>>> correctly.
>>>
>>> Therefore, limit the number of segments per skb to 100.  This should
>>> make no difference to behaviour unless the actual MSS is less than
>>> about 700.
>>
>> Please do not do this...or at least allow over-rides.  We love
>> the trick of seting very small MSS and making the NICs generate
>> huge numbers of small TCP frames with efficient user-space
>> logic.   We use this for stateful TCP load testing when high
>> numbers of tcp packets-per-second is desired.
>
> Please test whether this actually makes a difference - my suspicion is
> that 100 segments per skb is easily enough to prevent the host being a
> bottleneck.

Any CPU I can save I can use for other tasks.  If we can use the
NIC's offload features to segment pkts, then we get near linear
increase in pkts-per-second by adding NICs..at least up to whatever
the total bandwidth of the system is...

If you want to have the OS default to a safe value, that is
fine by me..but please give us a tunable so that we can get
the old behaviour.

It's always possible I'm not the only one using this,
and I think it would be considered bad form to break
existing features and provide no work-around.

Thanks,
Ben

>
>> Intel NICs, including 10G, work just fine with minimal MSS
>> in this scenario.
>
> I'll leave this to the Intel maintainers to answer.
>
> Ben.
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 19:35     ` Ben Hutchings
  2012-07-30 19:56       ` Ben Hutchings
@ 2012-07-30 21:46       ` David Miller
  2012-07-30 22:20         ` Ben Hutchings
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2012-07-30 21:46 UTC (permalink / raw)
  To: bhutchings; +Cc: eric.dumazet, netdev, linux-net-drivers

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 30 Jul 2012 20:35:52 +0100

> On Mon, 2012-07-30 at 19:31 +0200, Eric Dumazet wrote:
>> Or you could introduce a new wk->sk_gso_max_segments, that your sfc
>> driver sets to whatever limit ?
> 
> Yes, that's another option.

This is how I want this handled.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 21:46       ` David Miller
@ 2012-07-30 22:20         ` Ben Hutchings
  2012-07-30 22:50           ` Stephen Hemminger
  0 siblings, 1 reply; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 22:20 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, linux-net-drivers

On Mon, 2012-07-30 at 14:46 -0700, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Mon, 30 Jul 2012 20:35:52 +0100
> 
> > On Mon, 2012-07-30 at 19:31 +0200, Eric Dumazet wrote:
> >> Or you could introduce a new wk->sk_gso_max_segments, that your sfc
> >> driver sets to whatever limit ?
> > 
> > Yes, that's another option.
> 
> This is how I want this handled.

How should that be applied in the GRO-forwarding case?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 22:20         ` Ben Hutchings
@ 2012-07-30 22:50           ` Stephen Hemminger
  2012-07-30 23:07             ` Ben Hutchings
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2012-07-30 22:50 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, eric.dumazet, netdev, linux-net-drivers

On Mon, 30 Jul 2012 23:20:57 +0100
Ben Hutchings <bhutchings@solarflare.com> wrote:

> On Mon, 2012-07-30 at 14:46 -0700, David Miller wrote:
> > From: Ben Hutchings <bhutchings@solarflare.com>
> > Date: Mon, 30 Jul 2012 20:35:52 +0100
> > 
> > > On Mon, 2012-07-30 at 19:31 +0200, Eric Dumazet wrote:
> > >> Or you could introduce a new wk->sk_gso_max_segments, that your sfc
> > >> driver sets to whatever limit ?
> > > 
> > > Yes, that's another option.
> > 
> > This is how I want this handled.
> 
> How should that be applied in the GRO-forwarding case?
> 
> Ben.
> 
Why not make max_frags a property of the device?

Something like the following untested idea:
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ebaea1..bfb005b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2159,14 +2159,16 @@ EXPORT_SYMBOL(netif_skb_features);
  *	   at least one of fragments is in highmem and device does not
  *	   support DMA from it.
  */
-static inline int skb_needs_linearize(struct sk_buff *skb,
-				      int features)
+static inline bool skb_needs_linearize(struct sk_buff *skb,
+				       int features, unsigned int maxfrags)
 {
-	return skb_is_nonlinear(skb) &&
-			((skb_has_frag_list(skb) &&
-				!(features & NETIF_F_FRAGLIST)) ||
-			(skb_shinfo(skb)->nr_frags &&
-				!(features & NETIF_F_SG)));
+	if (!skb_is_nonlinear(skb))
+		return false;
+
+	if (skb_has_frag_list(skb))
+		return !(features & NETIF_F_FRAGLIST);
+	else
+		return skb_shinfo(skb)->nr_frags > maxfrags;
 }
 
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
@@ -2206,7 +2208,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			if (skb->next)
 				goto gso;
 		} else {
-			if (skb_needs_linearize(skb, features) &&
+			if (skb_needs_linearize(skb, features, dev->max_frags) &&
 			    __skb_linearize(skb))
 				goto out_kfree_skb;
 
@@ -5544,6 +5546,20 @@ int register_netdevice(struct net_device *dev)
 	dev->features |= NETIF_F_SOFT_FEATURES;
 	dev->wanted_features = dev->features & dev->hw_features;
 
+	if (dev->max_frags > 0) {
+		if (!(features & NETIF_F_SG)) {
+			netdev_dbg(dev,
+				   "Resetting max fragments since no NETIF_F_SG\n");
+			dev->max_frags = 0;
+		}
+	} else {
+		/* If device has not set maximum number of fragments
+		 * then assume it can take any number of them
+		 */
+		if (features & NETIF_F_SG)
+			dev->max_frags = MAX_SKB_FRAGS;
+	}
+
 	/* Turn on no cache copy if HW is doing checksum */
 	if (!(dev->flags & IFF_LOOPBACK)) {
 		dev->hw_features |= NETIF_F_NOCACHE_COPY;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb
  2012-07-30 22:50           ` Stephen Hemminger
@ 2012-07-30 23:07             ` Ben Hutchings
  0 siblings, 0 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-07-30 23:07 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, eric.dumazet, netdev, linux-net-drivers

On Mon, 2012-07-30 at 15:50 -0700, Stephen Hemminger wrote:
> On Mon, 30 Jul 2012 23:20:57 +0100
> Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> > On Mon, 2012-07-30 at 14:46 -0700, David Miller wrote:
> > > From: Ben Hutchings <bhutchings@solarflare.com>
> > > Date: Mon, 30 Jul 2012 20:35:52 +0100
> > > 
> > > > On Mon, 2012-07-30 at 19:31 +0200, Eric Dumazet wrote:
> > > >> Or you could introduce a new wk->sk_gso_max_segments, that your sfc
> > > >> driver sets to whatever limit ?
> > > > 
> > > > Yes, that's another option.
> > > 
> > > This is how I want this handled.
> > 
> > How should that be applied in the GRO-forwarding case?
> > 
> > Ben.
> > 
> Why not make max_frags a property of the device?
[...]

This has nothing to do with the number of input fragments.  But I think
you're on the right track - this can be checked in netif_skb_features()
or something like that.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-30 23:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-30 17:14 [PATCH net 0/2] Prevent extreme TSO parameters from stalling TX queues Ben Hutchings
2012-07-30 17:16 ` [PATCH net 1/2] tcp: Limit number of segments generated by GSO per skb Ben Hutchings
2012-07-30 17:23   ` Ben Greear
2012-07-30 19:41     ` Ben Hutchings
2012-07-30 21:00       ` Ben Greear
2012-07-30 17:31   ` Eric Dumazet
2012-07-30 19:35     ` Ben Hutchings
2012-07-30 19:56       ` Ben Hutchings
2012-07-30 21:46       ` David Miller
2012-07-30 22:20         ` Ben Hutchings
2012-07-30 22:50           ` Stephen Hemminger
2012-07-30 23:07             ` Ben Hutchings
2012-07-30 17:17 ` [PATCH net 2/2] sfc: Correct the minimum TX queue size Ben Hutchings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).