All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
@ 2009-05-29 14:14 Rusty Russell
  2009-05-29 15:11 ` Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Rusty Russell @ 2009-05-29 14:14 UTC (permalink / raw)
  To: netdev
  Cc: virtualization, Divy Le Ray, Roland Dreier, Pavel Emelianov,
	Dan Williams, libertas-dev, Divy Le Ray, Roland Dreier,
	Pavel Emelianov, Dan Williams, libertas-dev

Various drivers do skb_orphan() because they do not free transmitted
skbs in a timely manner (causing apps which hit their socket limits to
stall, socket close to hang, etc.).

DaveM points out that there are advantages to doing it generally (it's
more likely to be on same CPU than after xmit), and I couldn't find
any new starvation issues in simple benchmarking here.

This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

I removed the drivers' skb_orphan(), though it's harmless.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dan Williams <dcbw@redhat.com>
Cc: libertas-dev@lists.infradead.org
---
 drivers/net/cxgb3/sge.c            |   27 ---------------------------
 drivers/net/loopback.c             |    2 --
 drivers/net/mlx4/en_tx.c           |    4 ----
 drivers/net/niu.c                  |    3 +--
 drivers/net/veth.c                 |    2 --
 drivers/net/wireless/libertas/tx.c |    3 ---
 net/core/dev.c                     |   21 +++++----------------
 7 files changed, 6 insertions(+), 56 deletions(-)

diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
 	dev->trans_start = jiffies;
 	spin_unlock(&q->lock);
 
-	/*
-	 * We do not use Tx completion interrupts to free DMAd Tx packets.
-	 * This is good for performamce but means that we rely on new Tx
-	 * packets arriving to run the destructors of completed packets,
-	 * which open up space in their sockets' send queues.  Sometimes
-	 * we do not get such new packets causing Tx to stall.  A single
-	 * UDP transmitter is a good example of this situation.  We have
-	 * a clean up timer that periodically reclaims completed packets
-	 * but it doesn't run often enough (nor do we want it to) to prevent
-	 * lengthy stalls.  A solution to this problem is to run the
-	 * destructor early, after the packet is queued but before it's DMAd.
-	 * A cons is that we lie to socket memory accounting, but the amount
-	 * of extra memory is reasonable (limited by the number of Tx
-	 * descriptors), the packets do actually get freed quickly by new
-	 * packets almost always, and for protocols like TCP that wait for
-	 * acks to really free up the data the extra memory is even less.
-	 * On the positive side we run the destructors on the sending CPU
-	 * rather than on a potentially different completing CPU, usually a
-	 * good thing.  We also run them without holding our Tx queue lock,
-	 * unlike what reclaim_completed_tx() would otherwise do.
-	 *
-	 * Run the destructor before telling the DMA engine about the packet
-	 * to make sure it doesn't complete and get freed prematurely.
-	 */
-	if (likely(!skb_shared(skb)))
-		skb_orphan(skb);
-
 	write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
 	check_ring_tx_db(adap, q);
 	return NETDEV_TX_OK;
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff 
 {
 	struct pcpu_lstats *pcpu_lstats, *lb_stats;
 
-	skb_orphan(skb);
-
 	skb->protocol = eth_type_trans(skb,dev);
 
 	/* it's OK to use per_cpu_ptr() because BHs are off */
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
 	if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
-	/* Run destructor before passing skb to HW */
-	if (likely(!skb_shared(skb)))
-		skb_orphan(skb);
-
 	/* Ensure new descirptor hits memory
 	 * before setting ownership of this descriptor to HW */
 	wmb();
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
 		}
 		kfree_skb(skb);
 		skb = skb_new;
-	} else
-		skb_orphan(skb);
+	}
 
 	align = ((unsigned long) skb->data & (16 - 1));
 	headroom = align + sizeof(struct tx_pkt_hdr);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
 	struct veth_net_stats *stats, *rcv_stats;
 	int length, cpu;
 
-	skb_orphan(skb);
-
 	priv = netdev_priv(dev);
 	rcv = priv->peer;
 	rcv_priv = netdev_priv(rcv);
diff --git a/drivers/net/wireless/libertas/tx.c 
b/drivers/net/wireless/libertas/tx.c
--- a/drivers/net/wireless/libertas/tx.c
+++ b/drivers/net/wireless/libertas/tx.c
@@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
 	if (priv->monitormode) {
 		/* Keep the skb to echo it back once Tx feedback is
 		   received from FW */
-		skb_orphan(skb);
-
-		/* Keep the skb around for when we get feedback */
 		priv->currenttxskb = skb;
 	} else {
  free:
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
 	const struct net_device_ops *ops = dev->netdev_ops;
 	int rc;
 
+	/* Call destructor here.  It's SMP-cache-friendly and avoids issues
+	 * with drivers which can hold transmitted skbs for long times */
+	skb_orphan(skb);
+
 	if (likely(!skb->next)) {
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
@@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
 				goto gso;
 		}
 
-		rc = ops->ndo_start_xmit(skb, dev);
-		/*
-		 * TODO: if skb_orphan() was called by
-		 * dev->hard_start_xmit() (for example, the unmodified
-		 * igb driver does that; bnx2 doesn't), then
-		 * skb_tx_software_timestamp() will be unable to send
-		 * back the time stamp.
-		 *
-		 * How can this be prevented? Always create another
-		 * reference to the socket before calling
-		 * dev->hard_start_xmit()? Prevent that skb_orphan()
-		 * does anything in dev->hard_start_xmit() by clearing
-		 * the skb destructor before the call and restoring it
-		 * afterwards, then doing the skb_orphan() ourselves?
-		 */
-		return rc;
+		return ops->ndo_start_xmit(skb, dev);
 	}
 
 gso:



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
@ 2009-05-29 15:11 ` Eric Dumazet
  2009-06-01 12:27   ` Rusty Russell
  2009-05-29 15:11 ` Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Eric Dumazet @ 2009-05-29 15:11 UTC (permalink / raw)
  To: Rusty Russell
  Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
	Pavel Emelianov, Dan Williams, libertas-dev

Rusty Russell a écrit :
> Various drivers do skb_orphan() because they do not free transmitted
> skbs in a timely manner (causing apps which hit their socket limits to
> stall, socket close to hang, etc.).
> 
> DaveM points out that there are advantages to doing it generally (it's
> more likely to be on same CPU than after xmit), and I couldn't find
> any new starvation issues in simple benchmarking here.
> 
>

If really no starvations are possible at all, I really wonder why some
guys added memory accounting to UDP flows. Maybe they dont run "simple
benchmarks" but real apps ? :)

For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS
and window control, but an UDP sender will likely be able to saturate a link.

Maybe we can try to selectively call skb_orphan() only for tcp packets ?

I understand your motivations are the driver simplifications, so you
want all packets being orphaned... hmm...

 This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> I removed the drivers' skb_orphan(), though it's harmless.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Divy Le Ray <divy@chelsio.com>
> Cc: Roland Dreier <rolandd@cisco.com>
> Cc: Pavel Emelianov <xemul@openvz.org>
> Cc: Dan Williams <dcbw@redhat.com>
> Cc: libertas-dev@lists.infradead.org
> ---
>  drivers/net/cxgb3/sge.c            |   27 ---------------------------
>  drivers/net/loopback.c             |    2 --
>  drivers/net/mlx4/en_tx.c           |    4 ----
>  drivers/net/niu.c                  |    3 +--
>  drivers/net/veth.c                 |    2 --
>  drivers/net/wireless/libertas/tx.c |    3 ---
>  net/core/dev.c                     |   21 +++++----------------
>  7 files changed, 6 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
> --- a/drivers/net/cxgb3/sge.c
> +++ b/drivers/net/cxgb3/sge.c
> @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
>  	dev->trans_start = jiffies;
>  	spin_unlock(&q->lock);
>  
> -	/*
> -	 * We do not use Tx completion interrupts to free DMAd Tx packets.
> -	 * This is good for performamce but means that we rely on new Tx
> -	 * packets arriving to run the destructors of completed packets,
> -	 * which open up space in their sockets' send queues.  Sometimes
> -	 * we do not get such new packets causing Tx to stall.  A single
> -	 * UDP transmitter is a good example of this situation.  We have
> -	 * a clean up timer that periodically reclaims completed packets
> -	 * but it doesn't run often enough (nor do we want it to) to prevent
> -	 * lengthy stalls.  A solution to this problem is to run the
> -	 * destructor early, after the packet is queued but before it's DMAd.
> -	 * A cons is that we lie to socket memory accounting, but the amount
> -	 * of extra memory is reasonable (limited by the number of Tx
> -	 * descriptors), the packets do actually get freed quickly by new
> -	 * packets almost always, and for protocols like TCP that wait for
> -	 * acks to really free up the data the extra memory is even less.
> -	 * On the positive side we run the destructors on the sending CPU
> -	 * rather than on a potentially different completing CPU, usually a
> -	 * good thing.  We also run them without holding our Tx queue lock,
> -	 * unlike what reclaim_completed_tx() would otherwise do.
> -	 *
> -	 * Run the destructor before telling the DMA engine about the packet
> -	 * to make sure it doesn't complete and get freed prematurely.
> -	 */
> -	if (likely(!skb_shared(skb)))
> -		skb_orphan(skb);
> -
>  	write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
>  	check_ring_tx_db(adap, q);
>  	return NETDEV_TX_OK;
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff 
>  {
>  	struct pcpu_lstats *pcpu_lstats, *lb_stats;
>  
> -	skb_orphan(skb);
> -
>  	skb->protocol = eth_type_trans(skb,dev);
>  
>  	/* it's OK to use per_cpu_ptr() because BHs are off */
> diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> --- a/drivers/net/mlx4/en_tx.c
> +++ b/drivers/net/mlx4/en_tx.c
> @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
>  	if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
>  		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>  
> -	/* Run destructor before passing skb to HW */
> -	if (likely(!skb_shared(skb)))
> -		skb_orphan(skb);
> -
>  	/* Ensure new descirptor hits memory
>  	 * before setting ownership of this descriptor to HW */
>  	wmb();
> diff --git a/drivers/net/niu.c b/drivers/net/niu.c
> --- a/drivers/net/niu.c
> +++ b/drivers/net/niu.c
> @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
>  		}
>  		kfree_skb(skb);
>  		skb = skb_new;
> -	} else
> -		skb_orphan(skb);
> +	}
>  
>  	align = ((unsigned long) skb->data & (16 - 1));
>  	headroom = align + sizeof(struct tx_pkt_hdr);
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
>  	struct veth_net_stats *stats, *rcv_stats;
>  	int length, cpu;
>  
> -	skb_orphan(skb);
> -
>  	priv = netdev_priv(dev);
>  	rcv = priv->peer;
>  	rcv_priv = netdev_priv(rcv);
> diff --git a/drivers/net/wireless/libertas/tx.c 
> b/drivers/net/wireless/libertas/tx.c
> --- a/drivers/net/wireless/libertas/tx.c
> +++ b/drivers/net/wireless/libertas/tx.c
> @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
>  	if (priv->monitormode) {
>  		/* Keep the skb to echo it back once Tx feedback is
>  		   received from FW */
> -		skb_orphan(skb);
> -
> -		/* Keep the skb around for when we get feedback */
>  		priv->currenttxskb = skb;
>  	} else {
>   free:
> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
>  	const struct net_device_ops *ops = dev->netdev_ops;
>  	int rc;
>  
> +	/* Call destructor here.  It's SMP-cache-friendly and avoids issues
> +	 * with drivers which can hold transmitted skbs for long times */
> +	skb_orphan(skb);
> +
>  	if (likely(!skb->next)) {
>  		if (!list_empty(&ptype_all))
>  			dev_queue_xmit_nit(skb, dev);
> @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
>  				goto gso;
>  		}
>  
> -		rc = ops->ndo_start_xmit(skb, dev);
> -		/*
> -		 * TODO: if skb_orphan() was called by
> -		 * dev->hard_start_xmit() (for example, the unmodified
> -		 * igb driver does that; bnx2 doesn't), then
> -		 * skb_tx_software_timestamp() will be unable to send
> -		 * back the time stamp.
> -		 *
> -		 * How can this be prevented? Always create another
> -		 * reference to the socket before calling
> -		 * dev->hard_start_xmit()? Prevent that skb_orphan()
> -		 * does anything in dev->hard_start_xmit() by clearing
> -		 * the skb destructor before the call and restoring it
> -		 * afterwards, then doing the skb_orphan() ourselves?
> -		 */
> -		return rc;
> +		return ops->ndo_start_xmit(skb, dev);
>  	}
>  
>  gso:
> 
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
  2009-05-29 15:11 ` Eric Dumazet
@ 2009-05-29 15:11 ` Eric Dumazet
  2009-06-01 19:47 ` Patrick Ohly
  2009-06-01 19:47 ` Patrick Ohly
  3 siblings, 0 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-05-29 15:11 UTC (permalink / raw)
  To: Rusty Russell
  Cc: libertas-dev, Dan Williams, netdev, virtualization,
	Roland Dreier, Divy Le Ray, Pavel Emelianov

Rusty Russell a écrit :
> Various drivers do skb_orphan() because they do not free transmitted
> skbs in a timely manner (causing apps which hit their socket limits to
> stall, socket close to hang, etc.).
> 
> DaveM points out that there are advantages to doing it generally (it's
> more likely to be on same CPU than after xmit), and I couldn't find
> any new starvation issues in simple benchmarking here.
> 
>

If really no starvations are possible at all, I really wonder why some
guys added memory accounting to UDP flows. Maybe they dont run "simple
benchmarks" but real apps ? :)

For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS
and window control, but an UDP sender will likely be able to saturate a link.

Maybe we can try to selectively call skb_orphan() only for tcp packets ?

I understand your motivations are the driver simplifications, so you
want all packets being orphaned... hmm...

 This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> I removed the drivers' skb_orphan(), though it's harmless.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Divy Le Ray <divy@chelsio.com>
> Cc: Roland Dreier <rolandd@cisco.com>
> Cc: Pavel Emelianov <xemul@openvz.org>
> Cc: Dan Williams <dcbw@redhat.com>
> Cc: libertas-dev@lists.infradead.org
> ---
>  drivers/net/cxgb3/sge.c            |   27 ---------------------------
>  drivers/net/loopback.c             |    2 --
>  drivers/net/mlx4/en_tx.c           |    4 ----
>  drivers/net/niu.c                  |    3 +--
>  drivers/net/veth.c                 |    2 --
>  drivers/net/wireless/libertas/tx.c |    3 ---
>  net/core/dev.c                     |   21 +++++----------------
>  7 files changed, 6 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
> --- a/drivers/net/cxgb3/sge.c
> +++ b/drivers/net/cxgb3/sge.c
> @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
>  	dev->trans_start = jiffies;
>  	spin_unlock(&q->lock);
>  
> -	/*
> -	 * We do not use Tx completion interrupts to free DMAd Tx packets.
> -	 * This is good for performamce but means that we rely on new Tx
> -	 * packets arriving to run the destructors of completed packets,
> -	 * which open up space in their sockets' send queues.  Sometimes
> -	 * we do not get such new packets causing Tx to stall.  A single
> -	 * UDP transmitter is a good example of this situation.  We have
> -	 * a clean up timer that periodically reclaims completed packets
> -	 * but it doesn't run often enough (nor do we want it to) to prevent
> -	 * lengthy stalls.  A solution to this problem is to run the
> -	 * destructor early, after the packet is queued but before it's DMAd.
> -	 * A cons is that we lie to socket memory accounting, but the amount
> -	 * of extra memory is reasonable (limited by the number of Tx
> -	 * descriptors), the packets do actually get freed quickly by new
> -	 * packets almost always, and for protocols like TCP that wait for
> -	 * acks to really free up the data the extra memory is even less.
> -	 * On the positive side we run the destructors on the sending CPU
> -	 * rather than on a potentially different completing CPU, usually a
> -	 * good thing.  We also run them without holding our Tx queue lock,
> -	 * unlike what reclaim_completed_tx() would otherwise do.
> -	 *
> -	 * Run the destructor before telling the DMA engine about the packet
> -	 * to make sure it doesn't complete and get freed prematurely.
> -	 */
> -	if (likely(!skb_shared(skb)))
> -		skb_orphan(skb);
> -
>  	write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
>  	check_ring_tx_db(adap, q);
>  	return NETDEV_TX_OK;
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff 
>  {
>  	struct pcpu_lstats *pcpu_lstats, *lb_stats;
>  
> -	skb_orphan(skb);
> -
>  	skb->protocol = eth_type_trans(skb,dev);
>  
>  	/* it's OK to use per_cpu_ptr() because BHs are off */
> diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> --- a/drivers/net/mlx4/en_tx.c
> +++ b/drivers/net/mlx4/en_tx.c
> @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
>  	if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
>  		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>  
> -	/* Run destructor before passing skb to HW */
> -	if (likely(!skb_shared(skb)))
> -		skb_orphan(skb);
> -
>  	/* Ensure new descirptor hits memory
>  	 * before setting ownership of this descriptor to HW */
>  	wmb();
> diff --git a/drivers/net/niu.c b/drivers/net/niu.c
> --- a/drivers/net/niu.c
> +++ b/drivers/net/niu.c
> @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
>  		}
>  		kfree_skb(skb);
>  		skb = skb_new;
> -	} else
> -		skb_orphan(skb);
> +	}
>  
>  	align = ((unsigned long) skb->data & (16 - 1));
>  	headroom = align + sizeof(struct tx_pkt_hdr);
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
>  	struct veth_net_stats *stats, *rcv_stats;
>  	int length, cpu;
>  
> -	skb_orphan(skb);
> -
>  	priv = netdev_priv(dev);
>  	rcv = priv->peer;
>  	rcv_priv = netdev_priv(rcv);
> diff --git a/drivers/net/wireless/libertas/tx.c 
> b/drivers/net/wireless/libertas/tx.c
> --- a/drivers/net/wireless/libertas/tx.c
> +++ b/drivers/net/wireless/libertas/tx.c
> @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
>  	if (priv->monitormode) {
>  		/* Keep the skb to echo it back once Tx feedback is
>  		   received from FW */
> -		skb_orphan(skb);
> -
> -		/* Keep the skb around for when we get feedback */
>  		priv->currenttxskb = skb;
>  	} else {
>   free:
> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
>  	const struct net_device_ops *ops = dev->netdev_ops;
>  	int rc;
>  
> +	/* Call destructor here.  It's SMP-cache-friendly and avoids issues
> +	 * with drivers which can hold transmitted skbs for long times */
> +	skb_orphan(skb);
> +
>  	if (likely(!skb->next)) {
>  		if (!list_empty(&ptype_all))
>  			dev_queue_xmit_nit(skb, dev);
> @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
>  				goto gso;
>  		}
>  
> -		rc = ops->ndo_start_xmit(skb, dev);
> -		/*
> -		 * TODO: if skb_orphan() was called by
> -		 * dev->hard_start_xmit() (for example, the unmodified
> -		 * igb driver does that; bnx2 doesn't), then
> -		 * skb_tx_software_timestamp() will be unable to send
> -		 * back the time stamp.
> -		 *
> -		 * How can this be prevented? Always create another
> -		 * reference to the socket before calling
> -		 * dev->hard_start_xmit()? Prevent that skb_orphan()
> -		 * does anything in dev->hard_start_xmit() by clearing
> -		 * the skb destructor before the call and restoring it
> -		 * afterwards, then doing the skb_orphan() ourselves?
> -		 */
> -		return rc;
> +		return ops->ndo_start_xmit(skb, dev);
>  	}
>  
>  gso:
> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-05-29 15:11 ` Eric Dumazet
@ 2009-06-01 12:27   ` Rusty Russell
  2009-06-03 21:02     ` Eric Dumazet
  2009-06-03 21:02     ` Eric Dumazet
  0 siblings, 2 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-01 12:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: libertas-dev, Dan Williams, netdev, virtualization,
	Roland Dreier, Divy Le Ray, Pavel Emelianov

On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > DaveM points out that there are advantages to doing it generally (it's
> > more likely to be on same CPU than after xmit), and I couldn't find
> > any new starvation issues in simple benchmarking here.
>
> If really no starvations are possible at all, I really wonder why some
> guys added memory accounting to UDP flows. Maybe they dont run "simple
> benchmarks" but real apps ? :)

Well, without any accounting at all you could use quite a lot of memory as 
there are many places packets can be queued.

> For TCP, I agree your patch is a huge benefit, since its paced by remote
> ACKS and window control

I doubt that.  There'll be some cache friendliness, but I'm not sure it'll be 
measurable, let alone "huge".  It's the win to drivers which don't have a 
timely and batching tx free mechanism which I aim for.

> , but an UDP sender will likely be able to saturate
> a link.

I couldn't see any difference in saturation here (with default scheduler and an 
100MBit e1000e).  Two reasons come to mind: firstly, only the hardware queue is 
unregulated: the tx queue is still accounted.  And when you add scheduling to 
the mix, I can't in practice cause starvation of other senders.

Hope that clarifies,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
                   ` (2 preceding siblings ...)
  2009-06-01 19:47 ` Patrick Ohly
@ 2009-06-01 19:47 ` Patrick Ohly
  2009-06-02  7:25   ` David Miller
  2009-06-02  7:25   ` David Miller
  3 siblings, 2 replies; 46+ messages in thread
From: Patrick Ohly @ 2009-06-01 19:47 UTC (permalink / raw)
  To: Rusty Russell
  Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
	Pavel Emelianov, Dan Williams, libertas-dev

On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

Would it be possible to make the new skb_orphan() at the start of
dev_hard_start_xmit() conditionally so that it is not executed for
packets that are to be time stamped?

As discussed before
(http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
socket pointer is required for sending back the send time stamp from
inside the device driver. Calling skb_orphan() unconditionally as in
this patch would break the hardware time stamping of outgoing packets.

This should work without much overhead (skb_tx() expands to a lookup of
the skb_shared_info):
if (!skb_tx(skb)->flags)
    skb_orphan(skb);

Instead of looking at the individual skb_shared_tx "hardware" and
"software" bits I'm just checking the whole byte here because it is
shorter and perhaps faster. Semantically the result is the same.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
  2009-05-29 15:11 ` Eric Dumazet
  2009-05-29 15:11 ` Eric Dumazet
@ 2009-06-01 19:47 ` Patrick Ohly
  2009-06-01 19:47 ` Patrick Ohly
  3 siblings, 0 replies; 46+ messages in thread
From: Patrick Ohly @ 2009-06-01 19:47 UTC (permalink / raw)
  To: Rusty Russell
  Cc: libertas-dev, Dan Williams, netdev, virtualization,
	Roland Dreier, Divy Le Ray, Pavel Emelianov

On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

Would it be possible to make the new skb_orphan() at the start of
dev_hard_start_xmit() conditionally so that it is not executed for
packets that are to be time stamped?

As discussed before
(http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
socket pointer is required for sending back the send time stamp from
inside the device driver. Calling skb_orphan() unconditionally as in
this patch would break the hardware time stamping of outgoing packets.

This should work without much overhead (skb_tx() expands to a lookup of
the skb_shared_info):
if (!skb_tx(skb)->flags)
    skb_orphan(skb);

Instead of looking at the individual skb_shared_tx "hardware" and
"software" bits I'm just checking the whole byte here because it is
shorter and perhaps faster. Semantically the result is the same.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-01 19:47 ` Patrick Ohly
  2009-06-02  7:25   ` David Miller
@ 2009-06-02  7:25   ` David Miller
  2009-06-02 14:08     ` Rusty Russell
  2009-06-02 14:08     ` Rusty Russell
  1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2009-06-02  7:25 UTC (permalink / raw)
  To: patrick.ohly
  Cc: rusty, netdev, virtualization, divy, rolandd, xemul, dcbw, libertas-dev

From: Patrick Ohly <patrick.ohly@intel.com>
Date: Mon, 01 Jun 2009 21:47:22 +0200

> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> Would it be possible to make the new skb_orphan() at the start of
> dev_hard_start_xmit() conditionally so that it is not executed for
> packets that are to be time stamped?
> 
> As discussed before
> (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> socket pointer is required for sending back the send time stamp from
> inside the device driver. Calling skb_orphan() unconditionally as in
> this patch would break the hardware time stamping of outgoing packets.

Indeed, we need to check that case, at a minimum.

And there are other potentially other problems.  For example, I
wonder how this interacts with the new TX MMAP af_packet support
in net-next-2.6 :-/


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-01 19:47 ` Patrick Ohly
@ 2009-06-02  7:25   ` David Miller
  2009-06-02  7:25   ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-02  7:25 UTC (permalink / raw)
  To: patrick.ohly
  Cc: libertas-dev, dcbw, netdev, virtualization, rolandd, divy, xemul

From: Patrick Ohly <patrick.ohly@intel.com>
Date: Mon, 01 Jun 2009 21:47:22 +0200

> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> 
> Would it be possible to make the new skb_orphan() at the start of
> dev_hard_start_xmit() conditionally so that it is not executed for
> packets that are to be time stamped?
> 
> As discussed before
> (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> socket pointer is required for sending back the send time stamp from
> inside the device driver. Calling skb_orphan() unconditionally as in
> this patch would break the hardware time stamping of outgoing packets.

Indeed, we need to check that case, at a minimum.

And there are other potentially other problems.  For example, I
wonder how this interacts with the new TX MMAP af_packet support
in net-next-2.6 :-/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-02  7:25   ` David Miller
@ 2009-06-02 14:08     ` Rusty Russell
  2009-06-03  0:14       ` David Miller
  2009-06-02 14:08     ` Rusty Russell
  1 sibling, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2009-06-02 14:08 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, netdev, virtualization, divy, rolandd, xemul, dcbw,
	libertas-dev

On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
> From: Patrick Ohly <patrick.ohly@intel.com>
> Date: Mon, 01 Jun 2009 21:47:22 +0200
>
> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> >
> > Would it be possible to make the new skb_orphan() at the start of
> > dev_hard_start_xmit() conditionally so that it is not executed for
> > packets that are to be time stamped?
> >
> > As discussed before
> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> > socket pointer is required for sending back the send time stamp from
> > inside the device driver. Calling skb_orphan() unconditionally as in
> > this patch would break the hardware time stamping of outgoing packets.
>
> Indeed, we need to check that case, at a minimum.
>
> And there are other potentially other problems.  For example, I
> wonder how this interacts with the new TX MMAP af_packet support
> in net-next-2.6 :-/

I think I'll do this in the driver for now, and let's revisit doing it 
generically later?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-02  7:25   ` David Miller
  2009-06-02 14:08     ` Rusty Russell
@ 2009-06-02 14:08     ` Rusty Russell
  1 sibling, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-02 14:08 UTC (permalink / raw)
  To: David Miller
  Cc: rolandd, libertas-dev, dcbw, netdev, virtualization,
	patrick.ohly, divy, xemul

On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
> From: Patrick Ohly <patrick.ohly@intel.com>
> Date: Mon, 01 Jun 2009 21:47:22 +0200
>
> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> >
> > Would it be possible to make the new skb_orphan() at the start of
> > dev_hard_start_xmit() conditionally so that it is not executed for
> > packets that are to be time stamped?
> >
> > As discussed before
> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> > socket pointer is required for sending back the send time stamp from
> > inside the device driver. Calling skb_orphan() unconditionally as in
> > this patch would break the hardware time stamping of outgoing packets.
>
> Indeed, we need to check that case, at a minimum.
>
> And there are other potentially other problems.  For example, I
> wonder how this interacts with the new TX MMAP af_packet support
> in net-next-2.6 :-/

I think I'll do this in the driver for now, and let's revisit doing it 
generically later?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-02 14:08     ` Rusty Russell
@ 2009-06-03  0:14       ` David Miller
  2009-07-03  7:55         ` Herbert Xu
  2009-07-03  7:55         ` Herbert Xu
  0 siblings, 2 replies; 46+ messages in thread
From: David Miller @ 2009-06-03  0:14 UTC (permalink / raw)
  To: rusty
  Cc: rolandd, libertas-dev, dcbw, netdev, virtualization,
	patrick.ohly, divy, xemul

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Tue, 2 Jun 2009 23:38:29 +0930

> On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
>> From: Patrick Ohly <patrick.ohly@intel.com>
>> Date: Mon, 01 Jun 2009 21:47:22 +0200
>>
>> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>> >
>> > Would it be possible to make the new skb_orphan() at the start of
>> > dev_hard_start_xmit() conditionally so that it is not executed for
>> > packets that are to be time stamped?
>> >
>> > As discussed before
>> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
>> > socket pointer is required for sending back the send time stamp from
>> > inside the device driver. Calling skb_orphan() unconditionally as in
>> > this patch would break the hardware time stamping of outgoing packets.
>>
>> Indeed, we need to check that case, at a minimum.
>>
>> And there are other potentially other problems.  For example, I
>> wonder how this interacts with the new TX MMAP af_packet support
>> in net-next-2.6 :-/
> 
> I think I'll do this in the driver for now, and let's revisit doing it 
> generically later?

That might be the best course of action for the time being.
This whole area is a rat's nest.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-01 12:27   ` Rusty Russell
  2009-06-03 21:02     ` Eric Dumazet
@ 2009-06-03 21:02     ` Eric Dumazet
  2009-06-04  3:54       ` Rusty Russell
  2009-06-04  3:54       ` Rusty Russell
  1 sibling, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-03 21:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
	Pavel Emelianov, Dan Williams, libertas-dev

Rusty Russell a écrit :
> On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
>> Rusty Russell a écrit :
>>> DaveM points out that there are advantages to doing it generally (it's
>>> more likely to be on same CPU than after xmit), and I couldn't find
>>> any new starvation issues in simple benchmarking here.
>> If really no starvations are possible at all, I really wonder why some
>> guys added memory accounting to UDP flows. Maybe they dont run "simple
>> benchmarks" but real apps ? :)
> 
> Well, without any accounting at all you could use quite a lot of memory as 
> there are many places packets can be queued.
> 
>> For TCP, I agree your patch is a huge benefit, since its paced by remote
>> ACKS and window control
> 
> I doubt that.  There'll be some cache friendliness, but I'm not sure it'll be 
> measurable, let alone "huge".  It's the win to drivers which don't have a 
> timely and batching tx free mechanism which I aim for.

At 250.000 packets/second on a Gigabit link, this is huge, I can tell you.
(250.000 incoming packets and 250.000 outgoing packets per second, 700 Mbit/s)

According to this oprofile on CPU0 (dedicated to softirqs on one bnx2 eth adapter)

We can see sock_wfree() being number 2 on the profile, because it touches three cache lines per socket and
transmited packet in TX completion handler.

Also, taking a reference on socket for each xmit packet in flight is very expensive, since it slows
down receiver in __udp4_lib_lookup(). Several cpus are fighting for sk->refcnt cache line.


CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
21215    21215         11.8847  11.8847    bnx2_poll_work
17239    38454          9.6573  21.5420    sock_wfree      << effect of udp memory accounting >>
14817    53271          8.3005  29.8425    __slab_free
14635    67906          8.1986  38.0411    __udp4_lib_lookup
11425    79331          6.4003  44.4414    __alloc_skb
9710     89041          5.4396  49.8810    __slab_alloc
8095     97136          4.5348  54.4158    __udp4_lib_rcv
7831     104967         4.3869  58.8027    sock_def_write_space
7586     112553         4.2497  63.0524    ip_rcv
7518     120071         4.2116  67.2640    skb_dma_unmap
6711     126782         3.7595  71.0235    netif_receive_skb
6272     133054         3.5136  74.5371    udp_queue_rcv_skb
5262     138316         2.9478  77.4849    skb_release_data
5023     143339         2.8139  80.2988    __kmalloc_track_caller
4070     147409         2.2800  82.5788    kmem_cache_alloc
3216     150625         1.8016  84.3804    ipt_do_table
2576     153201         1.4431  85.8235    skb_queue_tail




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-01 12:27   ` Rusty Russell
@ 2009-06-03 21:02     ` Eric Dumazet
  2009-06-03 21:02     ` Eric Dumazet
  1 sibling, 0 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-03 21:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: libertas-dev, Dan Williams, netdev, virtualization,
	Roland Dreier, Divy Le Ray, Pavel Emelianov

Rusty Russell a écrit :
> On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
>> Rusty Russell a écrit :
>>> DaveM points out that there are advantages to doing it generally (it's
>>> more likely to be on same CPU than after xmit), and I couldn't find
>>> any new starvation issues in simple benchmarking here.
>> If really no starvations are possible at all, I really wonder why some
>> guys added memory accounting to UDP flows. Maybe they dont run "simple
>> benchmarks" but real apps ? :)
> 
> Well, without any accounting at all you could use quite a lot of memory as 
> there are many places packets can be queued.
> 
>> For TCP, I agree your patch is a huge benefit, since its paced by remote
>> ACKS and window control
> 
> I doubt that.  There'll be some cache friendliness, but I'm not sure it'll be 
> measurable, let alone "huge".  It's the win to drivers which don't have a 
> timely and batching tx free mechanism which I aim for.

At 250.000 packets/second on a Gigabit link, this is huge, I can tell you.
(250.000 incoming packets and 250.000 outgoing packets per second, 700 Mbit/s)

According to this oprofile on CPU0 (dedicated to softirqs on one bnx2 eth adapter)

We can see sock_wfree() being number 2 on the profile, because it touches three cache lines per socket and
transmited packet in TX completion handler.

Also, taking a reference on socket for each xmit packet in flight is very expensive, since it slows
down receiver in __udp4_lib_lookup(). Several cpus are fighting for sk->refcnt cache line.


CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
21215    21215         11.8847  11.8847    bnx2_poll_work
17239    38454          9.6573  21.5420    sock_wfree      << effect of udp memory accounting >>
14817    53271          8.3005  29.8425    __slab_free
14635    67906          8.1986  38.0411    __udp4_lib_lookup
11425    79331          6.4003  44.4414    __alloc_skb
9710     89041          5.4396  49.8810    __slab_alloc
8095     97136          4.5348  54.4158    __udp4_lib_rcv
7831     104967         4.3869  58.8027    sock_def_write_space
7586     112553         4.2497  63.0524    ip_rcv
7518     120071         4.2116  67.2640    skb_dma_unmap
6711     126782         3.7595  71.0235    netif_receive_skb
6272     133054         3.5136  74.5371    udp_queue_rcv_skb
5262     138316         2.9478  77.4849    skb_release_data
5023     143339         2.8139  80.2988    __kmalloc_track_caller
4070     147409         2.2800  82.5788    kmem_cache_alloc
3216     150625         1.8016  84.3804    ipt_do_table
2576     153201         1.4431  85.8235    skb_queue_tail

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-03 21:02     ` Eric Dumazet
@ 2009-06-04  3:54       ` Rusty Russell
  2009-06-04  4:00         ` David Miller
  2009-06-04  4:00         ` David Miller
  2009-06-04  3:54       ` Rusty Russell
  1 sibling, 2 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-04  3:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
	Pavel Emelianov, Dan Williams, libertas-dev

On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> >> For TCP, I agree your patch is a huge benefit, since its paced by remote
> >> ACKS and window control
> >
> > I doubt that.  There'll be some cache friendliness, but I'm not sure
> > it'll be measurable, let alone "huge".
...
> We can see sock_wfree() being number 2 on the profile, because it touches
> three cache lines per socket and transmited packet in TX completion
> handler.

Interesting, I take it back: got some "after" stats as well?

> Also, taking a reference on socket for each xmit packet in flight is very
> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
> cpus are fighting for sk->refcnt cache line.

Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
obvious for device counts than sockets, but perhaps applicable here as well?

Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-03 21:02     ` Eric Dumazet
  2009-06-04  3:54       ` Rusty Russell
@ 2009-06-04  3:54       ` Rusty Russell
  1 sibling, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-04  3:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: libertas-dev, Dan Williams, netdev, virtualization,
	Roland Dreier, Divy Le Ray, Pavel Emelianov

On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> >> For TCP, I agree your patch is a huge benefit, since its paced by remote
> >> ACKS and window control
> >
> > I doubt that.  There'll be some cache friendliness, but I'm not sure
> > it'll be measurable, let alone "huge".
...
> We can see sock_wfree() being number 2 on the profile, because it touches
> three cache lines per socket and transmited packet in TX completion
> handler.

Interesting, I take it back: got some "after" stats as well?

> Also, taking a reference on socket for each xmit packet in flight is very
> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
> cpus are fighting for sk->refcnt cache line.

Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
obvious for device counts than sockets, but perhaps applicable here as well?

Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-04  3:54       ` Rusty Russell
  2009-06-04  4:00         ` David Miller
@ 2009-06-04  4:00         ` David Miller
  2009-06-04  4:54           ` Eric Dumazet
  1 sibling, 1 reply; 46+ messages in thread
From: David Miller @ 2009-06-04  4:00 UTC (permalink / raw)
  To: rusty
  Cc: eric.dumazet, netdev, virtualization, divy, rolandd, xemul, dcbw,
	libertas-dev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 4 Jun 2009 13:24:57 +0930

> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>> Also, taking a reference on socket for each xmit packet in flight is very
>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>> cpus are fighting for sk->refcnt cache line.
> 
> Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
> obvious for device counts than sockets, but perhaps applicable here as well?

It might be very beneficial for longer lasting, active, connections, but
for high connection rates it's going to be a lose in my estimation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-04  3:54       ` Rusty Russell
@ 2009-06-04  4:00         ` David Miller
  2009-06-04  4:00         ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-04  4:00 UTC (permalink / raw)
  To: rusty
  Cc: eric.dumazet, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 4 Jun 2009 13:24:57 +0930

> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>> Also, taking a reference on socket for each xmit packet in flight is very
>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>> cpus are fighting for sk->refcnt cache line.
> 
> Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
> obvious for device counts than sockets, but perhaps applicable here as well?

It might be very beneficial for longer lasting, active, connections, but
for high connection rates it's going to be a lose in my estimation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-04  4:00         ` David Miller
@ 2009-06-04  4:54           ` Eric Dumazet
  2009-06-04  4:56             ` David Miller
  2009-06-04  4:56             ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit David Miller
  0 siblings, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-04  4:54 UTC (permalink / raw)
  To: David Miller
  Cc: libertas-dev, dcbw, netdev, virtualization, rolandd, divy, xemul

David Miller a écrit :
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Thu, 4 Jun 2009 13:24:57 +0930
> 
>> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>>> Also, taking a reference on socket for each xmit packet in flight is very
>>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>>> cpus are fighting for sk->refcnt cache line.
>> Now we have decent dynamic per-cpu, we can finally implement bigrefs.  More 
>> obvious for device counts than sockets, but perhaps applicable here as well?
> 
> It might be very beneficial for longer lasting, active, connections, but
> for high connection rates it's going to be a lose in my estimation.

Agreed.

We also can avoid the sock_put()/sock_hold() pair for each tx packet,
to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
and atomic_dec_test in sk_free

We could initialize sk->sk_wmem_alloc to one instead of 0, so that
sock_wfree() could just synchronize itself with sk_free()

void sk_free(struct sock *sk)
{
	if (atomic_dec_test(&sk->sk_wmem_alloc))
		__sk_free(sk)
}

 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
-       sock_hold(sk);
        skb->sk = sk;
        skb->destructor = sock_wfree;
        atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }

 void sock_wfree(struct sk_buff *skb)
 {
        struct sock *sk = skb->sk;
+       int res;

        /* In case it might be waiting for more memory. */
-       atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+       res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
        if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
                sk->sk_write_space(sk);
-       sock_put(sk);
+       if (res == 0)
+               __sk_free(sk);
 }

Patch will follow after some testing

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-04  4:54           ` Eric Dumazet
@ 2009-06-04  4:56             ` David Miller
  2009-06-04  9:18               ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
  2009-06-04  4:56             ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit David Miller
  1 sibling, 1 reply; 46+ messages in thread
From: David Miller @ 2009-06-04  4:56 UTC (permalink / raw)
  To: eric.dumazet
  Cc: rusty, netdev, virtualization, divy, rolandd, xemul, dcbw, libertas-dev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 06:54:24 +0200

> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
> and atomic_dec_test in sk_free
> 
> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
> sock_wfree() could just synchronize itself with sk_free()

Excellent idea Eric.

> Patch will follow after some testing

I look forward to it :-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-04  4:54           ` Eric Dumazet
  2009-06-04  4:56             ` David Miller
@ 2009-06-04  4:56             ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-04  4:56 UTC (permalink / raw)
  To: eric.dumazet
  Cc: libertas-dev, dcbw, netdev, virtualization, rolandd, divy, xemul

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 06:54:24 +0200

> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
> and atomic_dec_test in sk_free
> 
> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
> sock_wfree() could just synchronize itself with sk_free()

Excellent idea Eric.

> Patch will follow after some testing

I look forward to it :-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
  2009-06-04  4:56             ` David Miller
@ 2009-06-04  9:18               ` Eric Dumazet
  2009-06-04  9:26                 ` David Miller
  2009-06-10  8:17                 ` David Miller
  0 siblings, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-04  9:18 UTC (permalink / raw)
  To: David Miller; +Cc: rusty, netdev

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 04 Jun 2009 06:54:24 +0200
> 
>> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
>> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
>> and atomic_dec_test in sk_free
>>
>> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
>> sock_wfree() could just synchronize itself with sk_free()
> 
> Excellent idea Eric.

Thanks !

> 
>> Patch will follow after some testing
> 
> I look forward to it :-)

Here it is, based on net-next-2.6 

CC list trimmed down...

Next step would be to get rid of sk_callback_lock, I dont remember
if we already discussed about that rwlock, after RCUification of sockets...

Thanks

[PATCH] net: No more expensive sock_hold()/sock_put() on each tx

One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h    |    6 +++++-
 net/core/sock.c       |   29 +++++++++++++++++++++++++----
 net/ipv4/ip_output.c  |    1 -
 net/ipv6/ip6_output.c |    1 -
 4 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..010e14a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1217,9 +1217,13 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from,
 
 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
-	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	/*
+	 * We used to take a refcount on sk, but following operation
+	 * is enough to guarantee sk_free() wont free this sock until
+	 * all in-flight packets are completed
+	 */
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 58dec9d..ce0159a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1005,7 +1005,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 }
 EXPORT_SYMBOL(sk_alloc);
 
-void sk_free(struct sock *sk)
+static void __sk_free(struct sock *sk)
 {
 	struct sk_filter *filter;
 
@@ -1028,6 +1028,17 @@ void sk_free(struct sock *sk)
 	put_net(sock_net(sk));
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
+
+void sk_free(struct sock *sk)
+{
+	/*
+	 * We substract one from sk_wmem_alloc and can know if
+	 * some packets are still in some tx queue.
+	 * If not null, sock_wfree() will call __sk_free(sk) later
+	 */
+	if (atomic_dec_and_test(&sk->sk_wmem_alloc))
+		__sk_free(sk);
+}
 EXPORT_SYMBOL(sk_free);
 
 /*
@@ -1068,7 +1079,10 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		newsk->sk_backlog.head	= newsk->sk_backlog.tail = NULL;
 
 		atomic_set(&newsk->sk_rmem_alloc, 0);
-		atomic_set(&newsk->sk_wmem_alloc, 0);
+		/*
+		 * sk_wmem_alloc set to one (see sk_free() and sock_wfree())
+		 */
+		atomic_set(&newsk->sk_wmem_alloc, 1);
 		atomic_set(&newsk->sk_omem_alloc, 0);
 		skb_queue_head_init(&newsk->sk_receive_queue);
 		skb_queue_head_init(&newsk->sk_write_queue);
@@ -1172,12 +1186,18 @@ void __init sk_init(void)
 void sock_wfree(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
+	int res;
 
 	/* In case it might be waiting for more memory. */
-	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
 	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
 		sk->sk_write_space(sk);
-	sock_put(sk);
+	/*
+	 * if sk_wmem_alloc reached 0, we are last user and should
+	 * free this sock, as sk_free() call could not do it.
+	 */
+	if (res == 0)
+		__sk_free(sk);
 }
 EXPORT_SYMBOL(sock_wfree);
 
@@ -1816,6 +1836,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_stamp = ktime_set(-1L, 0);
 
 	atomic_set(&sk->sk_refcnt, 1);
+	atomic_set(&sk->sk_wmem_alloc, 1);
 	atomic_set(&sk->sk_drops, 0);
 }
 EXPORT_SYMBOL(sock_init_data);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3d6167f..badbfde 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -498,7 +498,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
-				sock_hold(skb->sk);
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
 				truesizes += frag->truesize;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c8dc8e5..18b9630 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -680,7 +680,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
-				sock_hold(skb->sk);
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
 				truesizes += frag->truesize;


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
  2009-06-04  9:18               ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
@ 2009-06-04  9:26                 ` David Miller
  2009-06-10  8:17                 ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-04  9:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rusty, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 11:18:35 +0200

> Next step would be to get rid of sk_callback_lock, I dont remember
> if we already discussed about that rwlock, after RCUification of sockets...

We discussed it, but the RCU'ification of core socket destruction
is pretty expensive.  Just do any simple connection rate based
test, like lat_connection from lmbench, and the effects are clear.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
  2009-06-04  9:18               ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
  2009-06-04  9:26                 ` David Miller
@ 2009-06-10  8:17                 ` David Miller
  2009-06-10  8:30                   ` Eric Dumazet
  1 sibling, 1 reply; 46+ messages in thread
From: David Miller @ 2009-06-10  8:17 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rusty, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 11:18:35 +0200

> @@ -1172,12 +1186,18 @@ void __init sk_init(void)
>  void sock_wfree(struct sk_buff *skb)
>  {
>  	struct sock *sk = skb->sk;
> +	int res;
>  
>  	/* In case it might be waiting for more memory. */
> -	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
> +	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
>  	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
>  		sk->sk_write_space(sk);
> -	sock_put(sk);
> +	/*
> +	 * if sk_wmem_alloc reached 0, we are last user and should
> +	 * free this sock, as sk_free() call could not do it.
> +	 */
> +	if (res == 0)
> +		__sk_free(sk);
>  }
>  EXPORT_SYMBOL(sock_wfree);
>  

Eric, I don't understand this part, please enlighten me :-)

Just because we've liberated all of the write buffer space, that does
not mean that it's time to kill off the socket completely.

Right?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
  2009-06-10  8:17                 ` David Miller
@ 2009-06-10  8:30                   ` Eric Dumazet
  2009-06-11  9:56                     ` David Miller
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Dumazet @ 2009-06-10  8:30 UTC (permalink / raw)
  To: David Miller; +Cc: rusty, netdev

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 04 Jun 2009 11:18:35 +0200
> 
>> @@ -1172,12 +1186,18 @@ void __init sk_init(void)
>>  void sock_wfree(struct sk_buff *skb)
>>  {
>>  	struct sock *sk = skb->sk;
>> +	int res;
>>  
>>  	/* In case it might be waiting for more memory. */
>> -	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
>> +	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
>>  	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
>>  		sk->sk_write_space(sk);
>> -	sock_put(sk);
>> +	/*
>> +	 * if sk_wmem_alloc reached 0, we are last user and should
>> +	 * free this sock, as sk_free() call could not do it.
>> +	 */
>> +	if (res == 0)
>> +		__sk_free(sk);
>>  }
>>  EXPORT_SYMBOL(sock_wfree);
>>  
> 
> Eric, I don't understand this part, please enlighten me :-)
> 
> Just because we've liberated all of the write buffer space, that does
> not mean that it's time to kill off the socket completely.
> 
> Right?

Remember we initialize this field to one.

If we freed all write buffer space, final value is one, not zero.

res == 0 only if we both freed all write buffer space, *and* socket was
also refcounted to 0 (sk_free() then realized it could not yet call __sk_free())

So we cheat a litle bit, because of this offset of one, we might block a sender a litle bit earlier :)

Thank you

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
  2009-06-10  8:30                   ` Eric Dumazet
@ 2009-06-11  9:56                     ` David Miller
  0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-11  9:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rusty, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 10 Jun 2009 10:30:49 +0200

> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Thu, 04 Jun 2009 11:18:35 +0200
>> 
>> Eric, I don't understand this part, please enlighten me :-)
>> 
>> Just because we've liberated all of the write buffer space, that does
>> not mean that it's time to kill off the socket completely.
>> 
>> Right?
> 
> Remember we initialize this field to one.
> 
> If we freed all write buffer space, final value is one, not zero.
> 
> res == 0 only if we both freed all write buffer space, *and* socket was
> also refcounted to 0 (sk_free() then realized it could not yet call __sk_free())
> 
> So we cheat a litle bit, because of this offset of one, we might block a sender a litle bit earlier :)

Now I understand, thanks!

Patch applied.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-03  0:14       ` David Miller
  2009-07-03  7:55         ` Herbert Xu
@ 2009-07-03  7:55         ` Herbert Xu
  2009-07-04  3:02           ` David Miller
  1 sibling, 1 reply; 46+ messages in thread
From: Herbert Xu @ 2009-07-03  7:55 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

David Miller <davem@davemloft.net> wrote:
>
>> I think I'll do this in the driver for now, and let's revisit doing it 
>> generically later?
> 
> That might be the best course of action for the time being.
> This whole area is a rat's nest.

Calling skb_orphan like this should be forbidden.  Apart from the
problems already raised, it is a sign that the driver is trying to
paper over a more serious issue of not cleaning up skb's timely.

Yes skb_orphan will work for the cases where calling the skb
destructor allows forward progress, but for the cases where you
really need to the skb to be freed (e.g., iSCSI or Xen), this
simply doesn't work.

So anytime someone tries to propose such a solution it is a sign
that they have bigger problems.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-06-03  0:14       ` David Miller
@ 2009-07-03  7:55         ` Herbert Xu
  2009-07-03  7:55         ` Herbert Xu
  1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-03  7:55 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

David Miller <davem@davemloft.net> wrote:
>
>> I think I'll do this in the driver for now, and let's revisit doing it 
>> generically later?
> 
> That might be the best course of action for the time being.
> This whole area is a rat's nest.

Calling skb_orphan like this should be forbidden.  Apart from the
problems already raised, it is a sign that the driver is trying to
paper over a more serious issue of not cleaning up skb's timely.

Yes skb_orphan will work for the cases where calling the skb
destructor allows forward progress, but for the cases where you
really need to the skb to be freed (e.g., iSCSI or Xen), this
simply doesn't work.

So anytime someone tries to propose such a solution it is a sign
that they have bigger problems.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-03  7:55         ` Herbert Xu
@ 2009-07-04  3:02           ` David Miller
  2009-07-04  3:08             ` Herbert Xu
  2009-07-04  3:08             ` Herbert Xu
  0 siblings, 2 replies; 46+ messages in thread
From: David Miller @ 2009-07-04  3:02 UTC (permalink / raw)
  To: herbert
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 3 Jul 2009 15:55:30 +0800

> Calling skb_orphan like this should be forbidden.  Apart from the
> problems already raised, it is a sign that the driver is trying to
> paper over a more serious issue of not cleaning up skb's timely.
> 
> Yes skb_orphan will work for the cases where calling the skb
> destructor allows forward progress, but for the cases where you
> really need to the skb to be freed (e.g., iSCSI or Xen), this
> simply doesn't work.
> 
> So anytime someone tries to propose such a solution it is a sign
> that they have bigger problems.

Agreed, but alas we are foaming at the mouth until we have a truly
usable alternative.

In particular the case of handling a device without usable TX
completion event indications is still quite troublesome.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  3:02           ` David Miller
@ 2009-07-04  3:08             ` Herbert Xu
  2009-07-04  3:13               ` David Miller
  2009-07-04  3:13               ` David Miller
  2009-07-04  3:08             ` Herbert Xu
  1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04  3:08 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>
> In particular the case of handling a device without usable TX
> completion event indications is still quite troublesome.

Which particular devices do you have in mind?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  3:02           ` David Miller
  2009-07-04  3:08             ` Herbert Xu
@ 2009-07-04  3:08             ` Herbert Xu
  1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04  3:08 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>
> In particular the case of handling a device without usable TX
> completion event indications is still quite troublesome.

Which particular devices do you have in mind?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  3:08             ` Herbert Xu
@ 2009-07-04  3:13               ` David Miller
  2009-07-04  7:42                 ` Herbert Xu
  2009-07-04  7:42                 ` Herbert Xu
  2009-07-04  3:13               ` David Miller
  1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2009-07-04  3:13 UTC (permalink / raw)
  To: herbert
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 4 Jul 2009 11:08:30 +0800

> On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>>
>> In particular the case of handling a device without usable TX
>> completion event indications is still quite troublesome.
> e
> Which particular devices do you have in mind?

NIU

I basically can't defer interrupts because the chip supports
per-TX-desc interrupt indications but it lacks an "all TX queue sent"
event.  So if, say, tell it to interrupt every 1/4 of the TX queue
then up to 1/4 of the queue can have packets "stuck" in there
if TX activity all of a sudden ceases.

The only thing I've come up with to be able to mitigate interrupts is
to use an hrtimer of some sort.  But that's going to be hard to get
right, and who knows what kind of latencies will be introduced for TX
completion packet freeing unless I am very carefull.

And finally this belongs in generic code, not in the NIU driver,
whatever we come up with.  Especially since my understanding is that
this is similar to what Rusty needs.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  3:08             ` Herbert Xu
  2009-07-04  3:13               ` David Miller
@ 2009-07-04  3:13               ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-07-04  3:13 UTC (permalink / raw)
  To: herbert
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 4 Jul 2009 11:08:30 +0800

> On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>>
>> In particular the case of handling a device without usable TX
>> completion event indications is still quite troublesome.
> e
> Which particular devices do you have in mind?

NIU

I basically can't defer interrupts because the chip supports
per-TX-desc interrupt indications but it lacks an "all TX queue sent"
event.  So if, say, tell it to interrupt every 1/4 of the TX queue
then up to 1/4 of the queue can have packets "stuck" in there
if TX activity all of a sudden ceases.

The only thing I've come up with to be able to mitigate interrupts is
to use an hrtimer of some sort.  But that's going to be hard to get
right, and who knows what kind of latencies will be introduced for TX
completion packet freeing unless I am very carefull.

And finally this belongs in generic code, not in the NIU driver,
whatever we come up with.  Especially since my understanding is that
this is similar to what Rusty needs.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  3:13               ` David Miller
@ 2009-07-04  7:42                 ` Herbert Xu
  2009-07-04  9:09                   ` Herbert Xu
  2009-07-04  9:09                   ` Herbert Xu
  2009-07-04  7:42                 ` Herbert Xu
  1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04  7:42 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

On Fri, Jul 03, 2009 at 08:13:47PM -0700, David Miller wrote:
>
> NIU
> 
> I basically can't defer interrupts because the chip supports
> per-TX-desc interrupt indications but it lacks an "all TX queue sent"
> event.  So if, say, tell it to interrupt every 1/4 of the TX queue
> then up to 1/4 of the queue can have packets "stuck" in there
> if TX activity all of a sudden ceases.

Here's an idea: We let the sender decide whether we need to enable
notification.  This decision would be carried as a flag in the skb.
For example, UDP would set this flag when its socket buffer is close
to capacity.  Routing would set this flag per NAPI run, etc.

Of course you'd ignore this flag completely if the qdisc queue is
non-empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  3:13               ` David Miller
  2009-07-04  7:42                 ` Herbert Xu
@ 2009-07-04  7:42                 ` Herbert Xu
  1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04  7:42 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

On Fri, Jul 03, 2009 at 08:13:47PM -0700, David Miller wrote:
>
> NIU
> 
> I basically can't defer interrupts because the chip supports
> per-TX-desc interrupt indications but it lacks an "all TX queue sent"
> event.  So if, say, tell it to interrupt every 1/4 of the TX queue
> then up to 1/4 of the queue can have packets "stuck" in there
> if TX activity all of a sudden ceases.

Here's an idea: We let the sender decide whether we need to enable
notification.  This decision would be carried as a flag in the skb.
For example, UDP would set this flag when its socket buffer is close
to capacity.  Routing would set this flag per NAPI run, etc.

Of course you'd ignore this flag completely if the qdisc queue is
non-empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  7:42                 ` Herbert Xu
@ 2009-07-04  9:09                   ` Herbert Xu
  2009-07-05  3:26                     ` Herbert Xu
  2009-07-04  9:09                   ` Herbert Xu
  1 sibling, 1 reply; 46+ messages in thread
From: Herbert Xu @ 2009-07-04  9:09 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

On Sat, Jul 04, 2009 at 03:42:45PM +0800, Herbert Xu wrote:
> 
> Here's an idea: We let the sender decide whether we need to enable
> notification.  This decision would be carried as a flag in the skb.
> For example, UDP would set this flag when its socket buffer is close
> to capacity.  Routing would set this flag per NAPI run, etc.

Actually it doesn't even matter for routing because only those
that are charged by the skb's or the pages care and they're the
only ones that would need to set this.

One potential problem is if the socket is constantly running
close to capacity, but that should only happen if the device
TX queue is also close to capacity which means that the qdisc
queue should be non-empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  7:42                 ` Herbert Xu
  2009-07-04  9:09                   ` Herbert Xu
@ 2009-07-04  9:09                   ` Herbert Xu
  1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04  9:09 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

On Sat, Jul 04, 2009 at 03:42:45PM +0800, Herbert Xu wrote:
> 
> Here's an idea: We let the sender decide whether we need to enable
> notification.  This decision would be carried as a flag in the skb.
> For example, UDP would set this flag when its socket buffer is close
> to capacity.  Routing would set this flag per NAPI run, etc.

Actually it doesn't even matter for routing because only those
that are charged by the skb's or the pages care and they're the
only ones that would need to set this.

One potential problem is if the socket is constantly running
close to capacity, but that should only happen if the device
TX queue is also close to capacity which means that the qdisc
queue should be non-empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-04  9:09                   ` Herbert Xu
@ 2009-07-05  3:26                     ` Herbert Xu
  2009-07-05  3:34                       ` Herbert Xu
  2009-07-05  3:34                       ` Herbert Xu
  0 siblings, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-05  3:26 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

On Sat, Jul 04, 2009 at 05:09:10PM +0800, Herbert Xu wrote:
>
> One potential problem is if the socket is constantly running
> close to capacity, but that should only happen if the device
> TX queue is also close to capacity which means that the qdisc
> queue should be non-empty.

Here's a another crazy idea:

Let's use dummy TX descriptors to generate an interrupt, either
with or without transmitting an actual packet on the wire depending
on the NIC.

xmit(skb)

	if (TX queue contains no interrupting descriptor &&
	    qdisc is empty)
		mark TX descriptor as interrupting

clean()

	do work
	if (TX queue contains no interrupting descriptor &&
	    TX queue non-empty &&
	    qdisc is empty)
		send dummy TX descriptor

This is based on the assumption that in the time it takes for
the NIC to process one packet and interrupt us, we can generate
more packets.  Obviously if we can't then even if the NIC had
a queue-empty interrupt capability it wouldn't help.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-05  3:26                     ` Herbert Xu
  2009-07-05  3:34                       ` Herbert Xu
@ 2009-07-05  3:34                       ` Herbert Xu
  2009-08-18  1:47                         ` David Miller
  2009-08-18  1:47                         ` David Miller
  1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-05  3:34 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

On Sun, Jul 05, 2009 at 11:26:58AM +0800, Herbert Xu wrote:
> 
> Here's a another crazy idea:
> 
> Let's use dummy TX descriptors to generate an interrupt, either
> with or without transmitting an actual packet on the wire depending
> on the NIC.

Here's an even crazier idea that doesn't use dummy descriptors.

xmit(skb)

	if (TX queue contains no interrupting descriptor &&
	    qdisc is empty)
		mark TX descriptor as interrupting

	if (TX queue now contains an interrupting descriptor &&
	    qdisc len < 2)
		stop queue

	if (TX ring full)
		stop queue

clean()

 	do work
	wake queue as per usual

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-05  3:26                     ` Herbert Xu
@ 2009-07-05  3:34                       ` Herbert Xu
  2009-07-05  3:34                       ` Herbert Xu
  1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-05  3:34 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

On Sun, Jul 05, 2009 at 11:26:58AM +0800, Herbert Xu wrote:
> 
> Here's a another crazy idea:
> 
> Let's use dummy TX descriptors to generate an interrupt, either
> with or without transmitting an actual packet on the wire depending
> on the NIC.

Here's an even crazier idea that doesn't use dummy descriptors.

xmit(skb)

	if (TX queue contains no interrupting descriptor &&
	    qdisc is empty)
		mark TX descriptor as interrupting

	if (TX queue now contains an interrupting descriptor &&
	    qdisc len < 2)
		stop queue

	if (TX ring full)
		stop queue

clean()

 	do work
	wake queue as per usual

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-05  3:34                       ` Herbert Xu
@ 2009-08-18  1:47                         ` David Miller
  2009-08-19  3:19                           ` Herbert Xu
  2009-08-19  3:19                           ` Herbert Xu
  2009-08-18  1:47                         ` David Miller
  1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2009-08-18  1:47 UTC (permalink / raw)
  To: herbert
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 5 Jul 2009 11:34:08 +0800

> Here's an even crazier idea that doesn't use dummy descriptors.
> 
> xmit(skb)
> 
> 	if (TX queue contains no interrupting descriptor &&
> 	    qdisc is empty)
> 		mark TX descriptor as interrupting
> 
> 	if (TX queue now contains an interrupting descriptor &&
> 	    qdisc len < 2)
> 		stop queue
> 
> 	if (TX ring full)
> 		stop queue
> 
> clean()
> 
>  	do work
> 	wake queue as per usual

I'm pretty sure that for normal TCP and UDP workloads, this is just
going to set the interrupt bit on the first packet that gets into the
queue, and then not in the rest.

TCP just loops over packets in the send queue, and at initial state
the qdisc will be empty.

It's very hard to get this to work as well as if we had a real
queue empty interrupt status event.

Even if you get upstream status from the protocols saying "there's
more packets coming" via some flag in the SKB, that only says what one
client feeding the TX ring is about to do.

It says nothing about other threads of control which are about to start
feeding packets to the same device.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-07-05  3:34                       ` Herbert Xu
  2009-08-18  1:47                         ` David Miller
@ 2009-08-18  1:47                         ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-08-18  1:47 UTC (permalink / raw)
  To: herbert
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 5 Jul 2009 11:34:08 +0800

> Here's an even crazier idea that doesn't use dummy descriptors.
> 
> xmit(skb)
> 
> 	if (TX queue contains no interrupting descriptor &&
> 	    qdisc is empty)
> 		mark TX descriptor as interrupting
> 
> 	if (TX queue now contains an interrupting descriptor &&
> 	    qdisc len < 2)
> 		stop queue
> 
> 	if (TX ring full)
> 		stop queue
> 
> clean()
> 
>  	do work
> 	wake queue as per usual

I'm pretty sure that for normal TCP and UDP workloads, this is just
going to set the interrupt bit on the first packet that gets into the
queue, and then not in the rest.

TCP just loops over packets in the send queue, and at initial state
the qdisc will be empty.

It's very hard to get this to work as well as if we had a real
queue empty interrupt status event.

Even if you get upstream status from the protocols saying "there's
more packets coming" via some flag in the SKB, that only says what one
client feeding the TX ring is about to do.

It says nothing about other threads of control which are about to start
feeding packets to the same device.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-08-18  1:47                         ` David Miller
  2009-08-19  3:19                           ` Herbert Xu
@ 2009-08-19  3:19                           ` Herbert Xu
  2009-08-19  3:34                             ` David Miller
  2009-08-19  3:34                             ` David Miller
  1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-08-19  3:19 UTC (permalink / raw)
  To: David Miller
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

On Mon, Aug 17, 2009 at 06:47:12PM -0700, David Miller wrote:
> 
> I'm pretty sure that for normal TCP and UDP workloads, this is just
> going to set the interrupt bit on the first packet that gets into the
> queue, and then not in the rest.
> 
> TCP just loops over packets in the send queue, and at initial state
> the qdisc will be empty.

The scheme I actually tried on the e1000e is not quite the same
as what I had here.  I'm basically just adding descriptors per
usual.

However, I don't give them to the hardware immediately.  Instead
they're held back so that on average we have about three descriptors
in the queue spaced equally that will cause interrupts.  It seems
to work fairly well for netperf-generated TCP/UDP traffic in terms
of generating the right amount of interrupts (once every qlen/3
packets).

I haven't posted anything yet because I ran into a weird problem
with the e1000e.  It was generating a lot more interrupts than
what I'm telling it to do.  Even if I hard-code the interrupts
at 0, qlen/3, 2qlen/2 I still get way more than qlen/3 interrupts
for qlen packets.

This may be related to the fact that e1000e doesn't really have
a way of turning the interrupts off for a given descriptor.  Well
actually it does but it also turns off status reporting which means
that we will never know whether that entry has finished processing.
So I've had to rely on using its TX interrupt delay mechanism as an
approximation of turning interrupt notification off.  Evidently that
is not working in the way I intended it to.

I'm in the process of repeating the same experiment with cxgb3
which hopefully should let me turn interrupts off on descriptors
while still reporting completion status.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-08-18  1:47                         ` David Miller
@ 2009-08-19  3:19                           ` Herbert Xu
  2009-08-19  3:19                           ` Herbert Xu
  1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-08-19  3:19 UTC (permalink / raw)
  To: David Miller
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

On Mon, Aug 17, 2009 at 06:47:12PM -0700, David Miller wrote:
> 
> I'm pretty sure that for normal TCP and UDP workloads, this is just
> going to set the interrupt bit on the first packet that gets into the
> queue, and then not in the rest.
> 
> TCP just loops over packets in the send queue, and at initial state
> the qdisc will be empty.

The scheme I actually tried on the e1000e is not quite the same
as what I had here.  I'm basically just adding descriptors per
usual.

However, I don't give them to the hardware immediately.  Instead
they're held back so that on average we have about three descriptors
in the queue spaced equally that will cause interrupts.  It seems
to work fairly well for netperf-generated TCP/UDP traffic in terms
of generating the right amount of interrupts (once every qlen/3
packets).

I haven't posted anything yet because I ran into a weird problem
with the e1000e.  It was generating a lot more interrupts than
what I'm telling it to do.  Even if I hard-code the interrupts
at 0, qlen/3, 2qlen/2 I still get way more than qlen/3 interrupts
for qlen packets.

This may be related to the fact that e1000e doesn't really have
a way of turning the interrupts off for a given descriptor.  Well
actually it does but it also turns off status reporting which means
that we will never know whether that entry has finished processing.
So I've had to rely on using its TX interrupt delay mechanism as an
approximation of turning interrupt notification off.  Evidently that
is not working in the way I intended it to.

I'm in the process of repeating the same experiment with cxgb3
which hopefully should let me turn interrupts off on descriptors
while still reporting completion status.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-08-19  3:19                           ` Herbert Xu
  2009-08-19  3:34                             ` David Miller
@ 2009-08-19  3:34                             ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-08-19  3:34 UTC (permalink / raw)
  To: herbert
  Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
	xemul, dcbw, libertas-dev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 Aug 2009 13:19:26 +1000

> I'm in the process of repeating the same experiment with cxgb3
> which hopefully should let me turn interrupts off on descriptors
> while still reporting completion status.

Ok, I look forward to seeing your work however it turns out.

Once I see what you've done, I'll give it a spin on niu.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
  2009-08-19  3:19                           ` Herbert Xu
@ 2009-08-19  3:34                             ` David Miller
  2009-08-19  3:34                             ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-08-19  3:34 UTC (permalink / raw)
  To: herbert
  Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
	rolandd, divy, xemul

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 Aug 2009 13:19:26 +1000

> I'm in the process of repeating the same experiment with cxgb3
> which hopefully should let me turn interrupts off on descriptors
> while still reporting completion status.

Ok, I look forward to seeing your work however it turns out.

Once I see what you've done, I'll give it a spin on niu.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
@ 2009-05-29 14:14 Rusty Russell
  0 siblings, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2009-05-29 14:14 UTC (permalink / raw)
  To: netdev
  Cc: libertas-dev, Dan Williams, virtualization, Roland Dreier,
	Divy Le Ray, Pavel Emelianov

Various drivers do skb_orphan() because they do not free transmitted
skbs in a timely manner (causing apps which hit their socket limits to
stall, socket close to hang, etc.).

DaveM points out that there are advantages to doing it generally (it's
more likely to be on same CPU than after xmit), and I couldn't find
any new starvation issues in simple benchmarking here.

This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
can be premature in the NETDEV_TX_BUSY case, but that's uncommon.

I removed the drivers' skb_orphan(), though it's harmless.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dan Williams <dcbw@redhat.com>
Cc: libertas-dev@lists.infradead.org
---
 drivers/net/cxgb3/sge.c            |   27 ---------------------------
 drivers/net/loopback.c             |    2 --
 drivers/net/mlx4/en_tx.c           |    4 ----
 drivers/net/niu.c                  |    3 +--
 drivers/net/veth.c                 |    2 --
 drivers/net/wireless/libertas/tx.c |    3 ---
 net/core/dev.c                     |   21 +++++----------------
 7 files changed, 6 insertions(+), 56 deletions(-)

diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
 	dev->trans_start = jiffies;
 	spin_unlock(&q->lock);
 
-	/*
-	 * We do not use Tx completion interrupts to free DMAd Tx packets.
-	 * This is good for performamce but means that we rely on new Tx
-	 * packets arriving to run the destructors of completed packets,
-	 * which open up space in their sockets' send queues.  Sometimes
-	 * we do not get such new packets causing Tx to stall.  A single
-	 * UDP transmitter is a good example of this situation.  We have
-	 * a clean up timer that periodically reclaims completed packets
-	 * but it doesn't run often enough (nor do we want it to) to prevent
-	 * lengthy stalls.  A solution to this problem is to run the
-	 * destructor early, after the packet is queued but before it's DMAd.
-	 * A cons is that we lie to socket memory accounting, but the amount
-	 * of extra memory is reasonable (limited by the number of Tx
-	 * descriptors), the packets do actually get freed quickly by new
-	 * packets almost always, and for protocols like TCP that wait for
-	 * acks to really free up the data the extra memory is even less.
-	 * On the positive side we run the destructors on the sending CPU
-	 * rather than on a potentially different completing CPU, usually a
-	 * good thing.  We also run them without holding our Tx queue lock,
-	 * unlike what reclaim_completed_tx() would otherwise do.
-	 *
-	 * Run the destructor before telling the DMA engine about the packet
-	 * to make sure it doesn't complete and get freed prematurely.
-	 */
-	if (likely(!skb_shared(skb)))
-		skb_orphan(skb);
-
 	write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
 	check_ring_tx_db(adap, q);
 	return NETDEV_TX_OK;
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff 
 {
 	struct pcpu_lstats *pcpu_lstats, *lb_stats;
 
-	skb_orphan(skb);
-
 	skb->protocol = eth_type_trans(skb,dev);
 
 	/* it's OK to use per_cpu_ptr() because BHs are off */
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
 	if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
-	/* Run destructor before passing skb to HW */
-	if (likely(!skb_shared(skb)))
-		skb_orphan(skb);
-
 	/* Ensure new descirptor hits memory
 	 * before setting ownership of this descriptor to HW */
 	wmb();
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
 		}
 		kfree_skb(skb);
 		skb = skb_new;
-	} else
-		skb_orphan(skb);
+	}
 
 	align = ((unsigned long) skb->data & (16 - 1));
 	headroom = align + sizeof(struct tx_pkt_hdr);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
 	struct veth_net_stats *stats, *rcv_stats;
 	int length, cpu;
 
-	skb_orphan(skb);
-
 	priv = netdev_priv(dev);
 	rcv = priv->peer;
 	rcv_priv = netdev_priv(rcv);
diff --git a/drivers/net/wireless/libertas/tx.c 
b/drivers/net/wireless/libertas/tx.c
--- a/drivers/net/wireless/libertas/tx.c
+++ b/drivers/net/wireless/libertas/tx.c
@@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
 	if (priv->monitormode) {
 		/* Keep the skb to echo it back once Tx feedback is
 		   received from FW */
-		skb_orphan(skb);
-
-		/* Keep the skb around for when we get feedback */
 		priv->currenttxskb = skb;
 	} else {
  free:
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
 	const struct net_device_ops *ops = dev->netdev_ops;
 	int rc;
 
+	/* Call destructor here.  It's SMP-cache-friendly and avoids issues
+	 * with drivers which can hold transmitted skbs for long times */
+	skb_orphan(skb);
+
 	if (likely(!skb->next)) {
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
@@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
 				goto gso;
 		}
 
-		rc = ops->ndo_start_xmit(skb, dev);
-		/*
-		 * TODO: if skb_orphan() was called by
-		 * dev->hard_start_xmit() (for example, the unmodified
-		 * igb driver does that; bnx2 doesn't), then
-		 * skb_tx_software_timestamp() will be unable to send
-		 * back the time stamp.
-		 *
-		 * How can this be prevented? Always create another
-		 * reference to the socket before calling
-		 * dev->hard_start_xmit()? Prevent that skb_orphan()
-		 * does anything in dev->hard_start_xmit() by clearing
-		 * the skb destructor before the call and restoring it
-		 * afterwards, then doing the skb_orphan() ourselves?
-		 */
-		return rc;
+		return ops->ndo_start_xmit(skb, dev);
 	}
 
 gso:

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2009-08-19  3:34 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
2009-06-01 12:27   ` Rusty Russell
2009-06-03 21:02     ` Eric Dumazet
2009-06-03 21:02     ` Eric Dumazet
2009-06-04  3:54       ` Rusty Russell
2009-06-04  4:00         ` David Miller
2009-06-04  4:00         ` David Miller
2009-06-04  4:54           ` Eric Dumazet
2009-06-04  4:56             ` David Miller
2009-06-04  9:18               ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
2009-06-04  9:26                 ` David Miller
2009-06-10  8:17                 ` David Miller
2009-06-10  8:30                   ` Eric Dumazet
2009-06-11  9:56                     ` David Miller
2009-06-04  4:56             ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit David Miller
2009-06-04  3:54       ` Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
2009-06-01 19:47 ` Patrick Ohly
2009-06-01 19:47 ` Patrick Ohly
2009-06-02  7:25   ` David Miller
2009-06-02  7:25   ` David Miller
2009-06-02 14:08     ` Rusty Russell
2009-06-03  0:14       ` David Miller
2009-07-03  7:55         ` Herbert Xu
2009-07-03  7:55         ` Herbert Xu
2009-07-04  3:02           ` David Miller
2009-07-04  3:08             ` Herbert Xu
2009-07-04  3:13               ` David Miller
2009-07-04  7:42                 ` Herbert Xu
2009-07-04  9:09                   ` Herbert Xu
2009-07-05  3:26                     ` Herbert Xu
2009-07-05  3:34                       ` Herbert Xu
2009-07-05  3:34                       ` Herbert Xu
2009-08-18  1:47                         ` David Miller
2009-08-19  3:19                           ` Herbert Xu
2009-08-19  3:19                           ` Herbert Xu
2009-08-19  3:34                             ` David Miller
2009-08-19  3:34                             ` David Miller
2009-08-18  1:47                         ` David Miller
2009-07-04  9:09                   ` Herbert Xu
2009-07-04  7:42                 ` Herbert Xu
2009-07-04  3:13               ` David Miller
2009-07-04  3:08             ` Herbert Xu
2009-06-02 14:08     ` Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2009-05-29 14:14 Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.