* [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
@ 2009-05-29 14:14 Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
` (3 more replies)
0 siblings, 4 replies; 46+ messages in thread
From: Rusty Russell @ 2009-05-29 14:14 UTC (permalink / raw)
To: netdev
Cc: virtualization, Divy Le Ray, Roland Dreier, Pavel Emelianov,
Dan Williams, libertas-dev, Divy Le Ray, Roland Dreier,
Pavel Emelianov, Dan Williams, libertas-dev
Various drivers do skb_orphan() because they do not free transmitted
skbs in a timely manner (causing apps which hit their socket limits to
stall, socket close to hang, etc.).
DaveM points out that there are advantages to doing it generally (it's
more likely to be on same CPU than after xmit), and I couldn't find
any new starvation issues in simple benchmarking here.
This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
I removed the drivers' skb_orphan(), though it's harmless.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dan Williams <dcbw@redhat.com>
Cc: libertas-dev@lists.infradead.org
---
drivers/net/cxgb3/sge.c | 27 ---------------------------
drivers/net/loopback.c | 2 --
drivers/net/mlx4/en_tx.c | 4 ----
drivers/net/niu.c | 3 +--
drivers/net/veth.c | 2 --
drivers/net/wireless/libertas/tx.c | 3 ---
net/core/dev.c | 21 +++++----------------
7 files changed, 6 insertions(+), 56 deletions(-)
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
dev->trans_start = jiffies;
spin_unlock(&q->lock);
- /*
- * We do not use Tx completion interrupts to free DMAd Tx packets.
- * This is good for performamce but means that we rely on new Tx
- * packets arriving to run the destructors of completed packets,
- * which open up space in their sockets' send queues. Sometimes
- * we do not get such new packets causing Tx to stall. A single
- * UDP transmitter is a good example of this situation. We have
- * a clean up timer that periodically reclaims completed packets
- * but it doesn't run often enough (nor do we want it to) to prevent
- * lengthy stalls. A solution to this problem is to run the
- * destructor early, after the packet is queued but before it's DMAd.
- * A cons is that we lie to socket memory accounting, but the amount
- * of extra memory is reasonable (limited by the number of Tx
- * descriptors), the packets do actually get freed quickly by new
- * packets almost always, and for protocols like TCP that wait for
- * acks to really free up the data the extra memory is even less.
- * On the positive side we run the destructors on the sending CPU
- * rather than on a potentially different completing CPU, usually a
- * good thing. We also run them without holding our Tx queue lock,
- * unlike what reclaim_completed_tx() would otherwise do.
- *
- * Run the destructor before telling the DMA engine about the packet
- * to make sure it doesn't complete and get freed prematurely.
- */
- if (likely(!skb_shared(skb)))
- skb_orphan(skb);
-
write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
check_ring_tx_db(adap, q);
return NETDEV_TX_OK;
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff
{
struct pcpu_lstats *pcpu_lstats, *lb_stats;
- skb_orphan(skb);
-
skb->protocol = eth_type_trans(skb,dev);
/* it's OK to use per_cpu_ptr() because BHs are off */
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
- /* Run destructor before passing skb to HW */
- if (likely(!skb_shared(skb)))
- skb_orphan(skb);
-
/* Ensure new descirptor hits memory
* before setting ownership of this descriptor to HW */
wmb();
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
}
kfree_skb(skb);
skb = skb_new;
- } else
- skb_orphan(skb);
+ }
align = ((unsigned long) skb->data & (16 - 1));
headroom = align + sizeof(struct tx_pkt_hdr);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
struct veth_net_stats *stats, *rcv_stats;
int length, cpu;
- skb_orphan(skb);
-
priv = netdev_priv(dev);
rcv = priv->peer;
rcv_priv = netdev_priv(rcv);
diff --git a/drivers/net/wireless/libertas/tx.c
b/drivers/net/wireless/libertas/tx.c
--- a/drivers/net/wireless/libertas/tx.c
+++ b/drivers/net/wireless/libertas/tx.c
@@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
if (priv->monitormode) {
/* Keep the skb to echo it back once Tx feedback is
received from FW */
- skb_orphan(skb);
-
- /* Keep the skb around for when we get feedback */
priv->currenttxskb = skb;
} else {
free:
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
const struct net_device_ops *ops = dev->netdev_ops;
int rc;
+ /* Call destructor here. It's SMP-cache-friendly and avoids issues
+ * with drivers which can hold transmitted skbs for long times */
+ skb_orphan(skb);
+
if (likely(!skb->next)) {
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
@@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
goto gso;
}
- rc = ops->ndo_start_xmit(skb, dev);
- /*
- * TODO: if skb_orphan() was called by
- * dev->hard_start_xmit() (for example, the unmodified
- * igb driver does that; bnx2 doesn't), then
- * skb_tx_software_timestamp() will be unable to send
- * back the time stamp.
- *
- * How can this be prevented? Always create another
- * reference to the socket before calling
- * dev->hard_start_xmit()? Prevent that skb_orphan()
- * does anything in dev->hard_start_xmit() by clearing
- * the skb destructor before the call and restoring it
- * afterwards, then doing the skb_orphan() ourselves?
- */
- return rc;
+ return ops->ndo_start_xmit(skb, dev);
}
gso:
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
@ 2009-05-29 15:11 ` Eric Dumazet
2009-06-01 12:27 ` Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
` (2 subsequent siblings)
3 siblings, 1 reply; 46+ messages in thread
From: Eric Dumazet @ 2009-05-29 15:11 UTC (permalink / raw)
To: Rusty Russell
Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
Pavel Emelianov, Dan Williams, libertas-dev
Rusty Russell a écrit :
> Various drivers do skb_orphan() because they do not free transmitted
> skbs in a timely manner (causing apps which hit their socket limits to
> stall, socket close to hang, etc.).
>
> DaveM points out that there are advantages to doing it generally (it's
> more likely to be on same CPU than after xmit), and I couldn't find
> any new starvation issues in simple benchmarking here.
>
>
If really no starvations are possible at all, I really wonder why some
guys added memory accounting to UDP flows. Maybe they dont run "simple
benchmarks" but real apps ? :)
For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS
and window control, but an UDP sender will likely be able to saturate a link.
Maybe we can try to selectively call skb_orphan() only for tcp packets ?
I understand your motivations are the driver simplifications, so you
want all packets being orphaned... hmm...
This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>
> I removed the drivers' skb_orphan(), though it's harmless.
>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Divy Le Ray <divy@chelsio.com>
> Cc: Roland Dreier <rolandd@cisco.com>
> Cc: Pavel Emelianov <xemul@openvz.org>
> Cc: Dan Williams <dcbw@redhat.com>
> Cc: libertas-dev@lists.infradead.org
> ---
> drivers/net/cxgb3/sge.c | 27 ---------------------------
> drivers/net/loopback.c | 2 --
> drivers/net/mlx4/en_tx.c | 4 ----
> drivers/net/niu.c | 3 +--
> drivers/net/veth.c | 2 --
> drivers/net/wireless/libertas/tx.c | 3 ---
> net/core/dev.c | 21 +++++----------------
> 7 files changed, 6 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
> --- a/drivers/net/cxgb3/sge.c
> +++ b/drivers/net/cxgb3/sge.c
> @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
> dev->trans_start = jiffies;
> spin_unlock(&q->lock);
>
> - /*
> - * We do not use Tx completion interrupts to free DMAd Tx packets.
> - * This is good for performamce but means that we rely on new Tx
> - * packets arriving to run the destructors of completed packets,
> - * which open up space in their sockets' send queues. Sometimes
> - * we do not get such new packets causing Tx to stall. A single
> - * UDP transmitter is a good example of this situation. We have
> - * a clean up timer that periodically reclaims completed packets
> - * but it doesn't run often enough (nor do we want it to) to prevent
> - * lengthy stalls. A solution to this problem is to run the
> - * destructor early, after the packet is queued but before it's DMAd.
> - * A cons is that we lie to socket memory accounting, but the amount
> - * of extra memory is reasonable (limited by the number of Tx
> - * descriptors), the packets do actually get freed quickly by new
> - * packets almost always, and for protocols like TCP that wait for
> - * acks to really free up the data the extra memory is even less.
> - * On the positive side we run the destructors on the sending CPU
> - * rather than on a potentially different completing CPU, usually a
> - * good thing. We also run them without holding our Tx queue lock,
> - * unlike what reclaim_completed_tx() would otherwise do.
> - *
> - * Run the destructor before telling the DMA engine about the packet
> - * to make sure it doesn't complete and get freed prematurely.
> - */
> - if (likely(!skb_shared(skb)))
> - skb_orphan(skb);
> -
> write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
> check_ring_tx_db(adap, q);
> return NETDEV_TX_OK;
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff
> {
> struct pcpu_lstats *pcpu_lstats, *lb_stats;
>
> - skb_orphan(skb);
> -
> skb->protocol = eth_type_trans(skb,dev);
>
> /* it's OK to use per_cpu_ptr() because BHs are off */
> diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> --- a/drivers/net/mlx4/en_tx.c
> +++ b/drivers/net/mlx4/en_tx.c
> @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
> if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
> tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>
> - /* Run destructor before passing skb to HW */
> - if (likely(!skb_shared(skb)))
> - skb_orphan(skb);
> -
> /* Ensure new descirptor hits memory
> * before setting ownership of this descriptor to HW */
> wmb();
> diff --git a/drivers/net/niu.c b/drivers/net/niu.c
> --- a/drivers/net/niu.c
> +++ b/drivers/net/niu.c
> @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
> }
> kfree_skb(skb);
> skb = skb_new;
> - } else
> - skb_orphan(skb);
> + }
>
> align = ((unsigned long) skb->data & (16 - 1));
> headroom = align + sizeof(struct tx_pkt_hdr);
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
> struct veth_net_stats *stats, *rcv_stats;
> int length, cpu;
>
> - skb_orphan(skb);
> -
> priv = netdev_priv(dev);
> rcv = priv->peer;
> rcv_priv = netdev_priv(rcv);
> diff --git a/drivers/net/wireless/libertas/tx.c
> b/drivers/net/wireless/libertas/tx.c
> --- a/drivers/net/wireless/libertas/tx.c
> +++ b/drivers/net/wireless/libertas/tx.c
> @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
> if (priv->monitormode) {
> /* Keep the skb to echo it back once Tx feedback is
> received from FW */
> - skb_orphan(skb);
> -
> - /* Keep the skb around for when we get feedback */
> priv->currenttxskb = skb;
> } else {
> free:
> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
> const struct net_device_ops *ops = dev->netdev_ops;
> int rc;
>
> + /* Call destructor here. It's SMP-cache-friendly and avoids issues
> + * with drivers which can hold transmitted skbs for long times */
> + skb_orphan(skb);
> +
> if (likely(!skb->next)) {
> if (!list_empty(&ptype_all))
> dev_queue_xmit_nit(skb, dev);
> @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
> goto gso;
> }
>
> - rc = ops->ndo_start_xmit(skb, dev);
> - /*
> - * TODO: if skb_orphan() was called by
> - * dev->hard_start_xmit() (for example, the unmodified
> - * igb driver does that; bnx2 doesn't), then
> - * skb_tx_software_timestamp() will be unable to send
> - * back the time stamp.
> - *
> - * How can this be prevented? Always create another
> - * reference to the socket before calling
> - * dev->hard_start_xmit()? Prevent that skb_orphan()
> - * does anything in dev->hard_start_xmit() by clearing
> - * the skb destructor before the call and restoring it
> - * afterwards, then doing the skb_orphan() ourselves?
> - */
> - return rc;
> + return ops->ndo_start_xmit(skb, dev);
> }
>
> gso:
>
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
@ 2009-05-29 15:11 ` Eric Dumazet
2009-06-01 19:47 ` Patrick Ohly
2009-06-01 19:47 ` Patrick Ohly
3 siblings, 0 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-05-29 15:11 UTC (permalink / raw)
To: Rusty Russell
Cc: libertas-dev, Dan Williams, netdev, virtualization,
Roland Dreier, Divy Le Ray, Pavel Emelianov
Rusty Russell a écrit :
> Various drivers do skb_orphan() because they do not free transmitted
> skbs in a timely manner (causing apps which hit their socket limits to
> stall, socket close to hang, etc.).
>
> DaveM points out that there are advantages to doing it generally (it's
> more likely to be on same CPU than after xmit), and I couldn't find
> any new starvation issues in simple benchmarking here.
>
>
If really no starvations are possible at all, I really wonder why some
guys added memory accounting to UDP flows. Maybe they dont run "simple
benchmarks" but real apps ? :)
For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS
and window control, but an UDP sender will likely be able to saturate a link.
Maybe we can try to selectively call skb_orphan() only for tcp packets ?
I understand your motivations are the driver simplifications, so you
want all packets being orphaned... hmm...
This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>
> I removed the drivers' skb_orphan(), though it's harmless.
>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Divy Le Ray <divy@chelsio.com>
> Cc: Roland Dreier <rolandd@cisco.com>
> Cc: Pavel Emelianov <xemul@openvz.org>
> Cc: Dan Williams <dcbw@redhat.com>
> Cc: libertas-dev@lists.infradead.org
> ---
> drivers/net/cxgb3/sge.c | 27 ---------------------------
> drivers/net/loopback.c | 2 --
> drivers/net/mlx4/en_tx.c | 4 ----
> drivers/net/niu.c | 3 +--
> drivers/net/veth.c | 2 --
> drivers/net/wireless/libertas/tx.c | 3 ---
> net/core/dev.c | 21 +++++----------------
> 7 files changed, 6 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
> --- a/drivers/net/cxgb3/sge.c
> +++ b/drivers/net/cxgb3/sge.c
> @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
> dev->trans_start = jiffies;
> spin_unlock(&q->lock);
>
> - /*
> - * We do not use Tx completion interrupts to free DMAd Tx packets.
> - * This is good for performamce but means that we rely on new Tx
> - * packets arriving to run the destructors of completed packets,
> - * which open up space in their sockets' send queues. Sometimes
> - * we do not get such new packets causing Tx to stall. A single
> - * UDP transmitter is a good example of this situation. We have
> - * a clean up timer that periodically reclaims completed packets
> - * but it doesn't run often enough (nor do we want it to) to prevent
> - * lengthy stalls. A solution to this problem is to run the
> - * destructor early, after the packet is queued but before it's DMAd.
> - * A cons is that we lie to socket memory accounting, but the amount
> - * of extra memory is reasonable (limited by the number of Tx
> - * descriptors), the packets do actually get freed quickly by new
> - * packets almost always, and for protocols like TCP that wait for
> - * acks to really free up the data the extra memory is even less.
> - * On the positive side we run the destructors on the sending CPU
> - * rather than on a potentially different completing CPU, usually a
> - * good thing. We also run them without holding our Tx queue lock,
> - * unlike what reclaim_completed_tx() would otherwise do.
> - *
> - * Run the destructor before telling the DMA engine about the packet
> - * to make sure it doesn't complete and get freed prematurely.
> - */
> - if (likely(!skb_shared(skb)))
> - skb_orphan(skb);
> -
> write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
> check_ring_tx_db(adap, q);
> return NETDEV_TX_OK;
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff
> {
> struct pcpu_lstats *pcpu_lstats, *lb_stats;
>
> - skb_orphan(skb);
> -
> skb->protocol = eth_type_trans(skb,dev);
>
> /* it's OK to use per_cpu_ptr() because BHs are off */
> diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> --- a/drivers/net/mlx4/en_tx.c
> +++ b/drivers/net/mlx4/en_tx.c
> @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
> if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
> tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>
> - /* Run destructor before passing skb to HW */
> - if (likely(!skb_shared(skb)))
> - skb_orphan(skb);
> -
> /* Ensure new descirptor hits memory
> * before setting ownership of this descriptor to HW */
> wmb();
> diff --git a/drivers/net/niu.c b/drivers/net/niu.c
> --- a/drivers/net/niu.c
> +++ b/drivers/net/niu.c
> @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
> }
> kfree_skb(skb);
> skb = skb_new;
> - } else
> - skb_orphan(skb);
> + }
>
> align = ((unsigned long) skb->data & (16 - 1));
> headroom = align + sizeof(struct tx_pkt_hdr);
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
> struct veth_net_stats *stats, *rcv_stats;
> int length, cpu;
>
> - skb_orphan(skb);
> -
> priv = netdev_priv(dev);
> rcv = priv->peer;
> rcv_priv = netdev_priv(rcv);
> diff --git a/drivers/net/wireless/libertas/tx.c
> b/drivers/net/wireless/libertas/tx.c
> --- a/drivers/net/wireless/libertas/tx.c
> +++ b/drivers/net/wireless/libertas/tx.c
> @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
> if (priv->monitormode) {
> /* Keep the skb to echo it back once Tx feedback is
> received from FW */
> - skb_orphan(skb);
> -
> - /* Keep the skb around for when we get feedback */
> priv->currenttxskb = skb;
> } else {
> free:
> diff --git a/net/core/dev.c b/net/core/dev.c
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
> const struct net_device_ops *ops = dev->netdev_ops;
> int rc;
>
> + /* Call destructor here. It's SMP-cache-friendly and avoids issues
> + * with drivers which can hold transmitted skbs for long times */
> + skb_orphan(skb);
> +
> if (likely(!skb->next)) {
> if (!list_empty(&ptype_all))
> dev_queue_xmit_nit(skb, dev);
> @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
> goto gso;
> }
>
> - rc = ops->ndo_start_xmit(skb, dev);
> - /*
> - * TODO: if skb_orphan() was called by
> - * dev->hard_start_xmit() (for example, the unmodified
> - * igb driver does that; bnx2 doesn't), then
> - * skb_tx_software_timestamp() will be unable to send
> - * back the time stamp.
> - *
> - * How can this be prevented? Always create another
> - * reference to the socket before calling
> - * dev->hard_start_xmit()? Prevent that skb_orphan()
> - * does anything in dev->hard_start_xmit() by clearing
> - * the skb destructor before the call and restoring it
> - * afterwards, then doing the skb_orphan() ourselves?
> - */
> - return rc;
> + return ops->ndo_start_xmit(skb, dev);
> }
>
> gso:
>
>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-05-29 15:11 ` Eric Dumazet
@ 2009-06-01 12:27 ` Rusty Russell
2009-06-03 21:02 ` Eric Dumazet
2009-06-03 21:02 ` Eric Dumazet
0 siblings, 2 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-01 12:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: libertas-dev, Dan Williams, netdev, virtualization,
Roland Dreier, Divy Le Ray, Pavel Emelianov
On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > DaveM points out that there are advantages to doing it generally (it's
> > more likely to be on same CPU than after xmit), and I couldn't find
> > any new starvation issues in simple benchmarking here.
>
> If really no starvations are possible at all, I really wonder why some
> guys added memory accounting to UDP flows. Maybe they dont run "simple
> benchmarks" but real apps ? :)
Well, without any accounting at all you could use quite a lot of memory as
there are many places packets can be queued.
> For TCP, I agree your patch is a huge benefit, since its paced by remote
> ACKS and window control
I doubt that. There'll be some cache friendliness, but I'm not sure it'll be
measurable, let alone "huge". It's the win to drivers which don't have a
timely and batching tx free mechanism which I aim for.
> , but an UDP sender will likely be able to saturate
> a link.
I couldn't see any difference in saturation here (with default scheduler and an
100MBit e1000e). Two reasons come to mind: firstly, only the hardware queue is
unregulated: the tx queue is still accounted. And when you add scheduling to
the mix, I can't in practice cause starvation of other senders.
Hope that clarifies,
Rusty.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
` (2 preceding siblings ...)
2009-06-01 19:47 ` Patrick Ohly
@ 2009-06-01 19:47 ` Patrick Ohly
2009-06-02 7:25 ` David Miller
2009-06-02 7:25 ` David Miller
3 siblings, 2 replies; 46+ messages in thread
From: Patrick Ohly @ 2009-06-01 19:47 UTC (permalink / raw)
To: Rusty Russell
Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
Pavel Emelianov, Dan Williams, libertas-dev
On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
Would it be possible to make the new skb_orphan() at the start of
dev_hard_start_xmit() conditionally so that it is not executed for
packets that are to be time stamped?
As discussed before
(http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
socket pointer is required for sending back the send time stamp from
inside the device driver. Calling skb_orphan() unconditionally as in
this patch would break the hardware time stamping of outgoing packets.
This should work without much overhead (skb_tx() expands to a lookup of
the skb_shared_info):
if (!skb_tx(skb)->flags)
skb_orphan(skb);
Instead of looking at the individual skb_shared_tx "hardware" and
"software" bits I'm just checking the whole byte here because it is
shorter and perhaps faster. Semantically the result is the same.
--
Best Regards, Patrick Ohly
The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
2009-05-29 15:11 ` Eric Dumazet
@ 2009-06-01 19:47 ` Patrick Ohly
2009-06-01 19:47 ` Patrick Ohly
3 siblings, 0 replies; 46+ messages in thread
From: Patrick Ohly @ 2009-06-01 19:47 UTC (permalink / raw)
To: Rusty Russell
Cc: libertas-dev, Dan Williams, netdev, virtualization,
Roland Dreier, Divy Le Ray, Pavel Emelianov
On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
Would it be possible to make the new skb_orphan() at the start of
dev_hard_start_xmit() conditionally so that it is not executed for
packets that are to be time stamped?
As discussed before
(http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
socket pointer is required for sending back the send time stamp from
inside the device driver. Calling skb_orphan() unconditionally as in
this patch would break the hardware time stamping of outgoing packets.
This should work without much overhead (skb_tx() expands to a lookup of
the skb_shared_info):
if (!skb_tx(skb)->flags)
skb_orphan(skb);
Instead of looking at the individual skb_shared_tx "hardware" and
"software" bits I'm just checking the whole byte here because it is
shorter and perhaps faster. Semantically the result is the same.
--
Best Regards, Patrick Ohly
The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-01 19:47 ` Patrick Ohly
2009-06-02 7:25 ` David Miller
@ 2009-06-02 7:25 ` David Miller
2009-06-02 14:08 ` Rusty Russell
2009-06-02 14:08 ` Rusty Russell
1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2009-06-02 7:25 UTC (permalink / raw)
To: patrick.ohly
Cc: rusty, netdev, virtualization, divy, rolandd, xemul, dcbw, libertas-dev
From: Patrick Ohly <patrick.ohly@intel.com>
Date: Mon, 01 Jun 2009 21:47:22 +0200
> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>
> Would it be possible to make the new skb_orphan() at the start of
> dev_hard_start_xmit() conditionally so that it is not executed for
> packets that are to be time stamped?
>
> As discussed before
> (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> socket pointer is required for sending back the send time stamp from
> inside the device driver. Calling skb_orphan() unconditionally as in
> this patch would break the hardware time stamping of outgoing packets.
Indeed, we need to check that case, at a minimum.
And there are other potentially other problems. For example, I
wonder how this interacts with the new TX MMAP af_packet support
in net-next-2.6 :-/
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-01 19:47 ` Patrick Ohly
@ 2009-06-02 7:25 ` David Miller
2009-06-02 7:25 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-02 7:25 UTC (permalink / raw)
To: patrick.ohly
Cc: libertas-dev, dcbw, netdev, virtualization, rolandd, divy, xemul
From: Patrick Ohly <patrick.ohly@intel.com>
Date: Mon, 01 Jun 2009 21:47:22 +0200
> On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>
> Would it be possible to make the new skb_orphan() at the start of
> dev_hard_start_xmit() conditionally so that it is not executed for
> packets that are to be time stamped?
>
> As discussed before
> (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> socket pointer is required for sending back the send time stamp from
> inside the device driver. Calling skb_orphan() unconditionally as in
> this patch would break the hardware time stamping of outgoing packets.
Indeed, we need to check that case, at a minimum.
And there are other potentially other problems. For example, I
wonder how this interacts with the new TX MMAP af_packet support
in net-next-2.6 :-/
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-02 7:25 ` David Miller
@ 2009-06-02 14:08 ` Rusty Russell
2009-06-03 0:14 ` David Miller
2009-06-02 14:08 ` Rusty Russell
1 sibling, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2009-06-02 14:08 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, netdev, virtualization, divy, rolandd, xemul, dcbw,
libertas-dev
On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
> From: Patrick Ohly <patrick.ohly@intel.com>
> Date: Mon, 01 Jun 2009 21:47:22 +0200
>
> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> >
> > Would it be possible to make the new skb_orphan() at the start of
> > dev_hard_start_xmit() conditionally so that it is not executed for
> > packets that are to be time stamped?
> >
> > As discussed before
> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> > socket pointer is required for sending back the send time stamp from
> > inside the device driver. Calling skb_orphan() unconditionally as in
> > this patch would break the hardware time stamping of outgoing packets.
>
> Indeed, we need to check that case, at a minimum.
>
> And there are other potentially other problems. For example, I
> wonder how this interacts with the new TX MMAP af_packet support
> in net-next-2.6 :-/
I think I'll do this in the driver for now, and let's revisit doing it
generically later?
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-02 7:25 ` David Miller
2009-06-02 14:08 ` Rusty Russell
@ 2009-06-02 14:08 ` Rusty Russell
1 sibling, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-02 14:08 UTC (permalink / raw)
To: David Miller
Cc: rolandd, libertas-dev, dcbw, netdev, virtualization,
patrick.ohly, divy, xemul
On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
> From: Patrick Ohly <patrick.ohly@intel.com>
> Date: Mon, 01 Jun 2009 21:47:22 +0200
>
> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
> >
> > Would it be possible to make the new skb_orphan() at the start of
> > dev_hard_start_xmit() conditionally so that it is not executed for
> > packets that are to be time stamped?
> >
> > As discussed before
> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
> > socket pointer is required for sending back the send time stamp from
> > inside the device driver. Calling skb_orphan() unconditionally as in
> > this patch would break the hardware time stamping of outgoing packets.
>
> Indeed, we need to check that case, at a minimum.
>
> And there are other potentially other problems. For example, I
> wonder how this interacts with the new TX MMAP af_packet support
> in net-next-2.6 :-/
I think I'll do this in the driver for now, and let's revisit doing it
generically later?
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-02 14:08 ` Rusty Russell
@ 2009-06-03 0:14 ` David Miller
2009-07-03 7:55 ` Herbert Xu
2009-07-03 7:55 ` Herbert Xu
0 siblings, 2 replies; 46+ messages in thread
From: David Miller @ 2009-06-03 0:14 UTC (permalink / raw)
To: rusty
Cc: rolandd, libertas-dev, dcbw, netdev, virtualization,
patrick.ohly, divy, xemul
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Tue, 2 Jun 2009 23:38:29 +0930
> On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote:
>> From: Patrick Ohly <patrick.ohly@intel.com>
>> Date: Mon, 01 Jun 2009 21:47:22 +0200
>>
>> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote:
>> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
>> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
>> >
>> > Would it be possible to make the new skb_orphan() at the start of
>> > dev_hard_start_xmit() conditionally so that it is not executed for
>> > packets that are to be time stamped?
>> >
>> > As discussed before
>> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk
>> > socket pointer is required for sending back the send time stamp from
>> > inside the device driver. Calling skb_orphan() unconditionally as in
>> > this patch would break the hardware time stamping of outgoing packets.
>>
>> Indeed, we need to check that case, at a minimum.
>>
>> And there are other potentially other problems. For example, I
>> wonder how this interacts with the new TX MMAP af_packet support
>> in net-next-2.6 :-/
>
> I think I'll do this in the driver for now, and let's revisit doing it
> generically later?
That might be the best course of action for the time being.
This whole area is a rat's nest.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-01 12:27 ` Rusty Russell
2009-06-03 21:02 ` Eric Dumazet
@ 2009-06-03 21:02 ` Eric Dumazet
2009-06-04 3:54 ` Rusty Russell
2009-06-04 3:54 ` Rusty Russell
1 sibling, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-03 21:02 UTC (permalink / raw)
To: Rusty Russell
Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
Pavel Emelianov, Dan Williams, libertas-dev
Rusty Russell a écrit :
> On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
>> Rusty Russell a écrit :
>>> DaveM points out that there are advantages to doing it generally (it's
>>> more likely to be on same CPU than after xmit), and I couldn't find
>>> any new starvation issues in simple benchmarking here.
>> If really no starvations are possible at all, I really wonder why some
>> guys added memory accounting to UDP flows. Maybe they dont run "simple
>> benchmarks" but real apps ? :)
>
> Well, without any accounting at all you could use quite a lot of memory as
> there are many places packets can be queued.
>
>> For TCP, I agree your patch is a huge benefit, since its paced by remote
>> ACKS and window control
>
> I doubt that. There'll be some cache friendliness, but I'm not sure it'll be
> measurable, let alone "huge". It's the win to drivers which don't have a
> timely and batching tx free mechanism which I aim for.
At 250.000 packets/second on a Gigabit link, this is huge, I can tell you.
(250.000 incoming packets and 250.000 outgoing packets per second, 700 Mbit/s)
According to this oprofile on CPU0 (dedicated to softirqs on one bnx2 eth adapter)
We can see sock_wfree() being number 2 on the profile, because it touches three cache lines per socket and
transmited packet in TX completion handler.
Also, taking a reference on socket for each xmit packet in flight is very expensive, since it slows
down receiver in __udp4_lib_lookup(). Several cpus are fighting for sk->refcnt cache line.
CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
21215 21215 11.8847 11.8847 bnx2_poll_work
17239 38454 9.6573 21.5420 sock_wfree << effect of udp memory accounting >>
14817 53271 8.3005 29.8425 __slab_free
14635 67906 8.1986 38.0411 __udp4_lib_lookup
11425 79331 6.4003 44.4414 __alloc_skb
9710 89041 5.4396 49.8810 __slab_alloc
8095 97136 4.5348 54.4158 __udp4_lib_rcv
7831 104967 4.3869 58.8027 sock_def_write_space
7586 112553 4.2497 63.0524 ip_rcv
7518 120071 4.2116 67.2640 skb_dma_unmap
6711 126782 3.7595 71.0235 netif_receive_skb
6272 133054 3.5136 74.5371 udp_queue_rcv_skb
5262 138316 2.9478 77.4849 skb_release_data
5023 143339 2.8139 80.2988 __kmalloc_track_caller
4070 147409 2.2800 82.5788 kmem_cache_alloc
3216 150625 1.8016 84.3804 ipt_do_table
2576 153201 1.4431 85.8235 skb_queue_tail
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-01 12:27 ` Rusty Russell
@ 2009-06-03 21:02 ` Eric Dumazet
2009-06-03 21:02 ` Eric Dumazet
1 sibling, 0 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-03 21:02 UTC (permalink / raw)
To: Rusty Russell
Cc: libertas-dev, Dan Williams, netdev, virtualization,
Roland Dreier, Divy Le Ray, Pavel Emelianov
Rusty Russell a écrit :
> On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
>> Rusty Russell a écrit :
>>> DaveM points out that there are advantages to doing it generally (it's
>>> more likely to be on same CPU than after xmit), and I couldn't find
>>> any new starvation issues in simple benchmarking here.
>> If really no starvations are possible at all, I really wonder why some
>> guys added memory accounting to UDP flows. Maybe they dont run "simple
>> benchmarks" but real apps ? :)
>
> Well, without any accounting at all you could use quite a lot of memory as
> there are many places packets can be queued.
>
>> For TCP, I agree your patch is a huge benefit, since its paced by remote
>> ACKS and window control
>
> I doubt that. There'll be some cache friendliness, but I'm not sure it'll be
> measurable, let alone "huge". It's the win to drivers which don't have a
> timely and batching tx free mechanism which I aim for.
At 250.000 packets/second on a Gigabit link, this is huge, I can tell you.
(250.000 incoming packets and 250.000 outgoing packets per second, 700 Mbit/s)
According to this oprofile on CPU0 (dedicated to softirqs on one bnx2 eth adapter)
We can see sock_wfree() being number 2 on the profile, because it touches three cache lines per socket and
transmited packet in TX completion handler.
Also, taking a reference on socket for each xmit packet in flight is very expensive, since it slows
down receiver in __udp4_lib_lookup(). Several cpus are fighting for sk->refcnt cache line.
CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
21215 21215 11.8847 11.8847 bnx2_poll_work
17239 38454 9.6573 21.5420 sock_wfree << effect of udp memory accounting >>
14817 53271 8.3005 29.8425 __slab_free
14635 67906 8.1986 38.0411 __udp4_lib_lookup
11425 79331 6.4003 44.4414 __alloc_skb
9710 89041 5.4396 49.8810 __slab_alloc
8095 97136 4.5348 54.4158 __udp4_lib_rcv
7831 104967 4.3869 58.8027 sock_def_write_space
7586 112553 4.2497 63.0524 ip_rcv
7518 120071 4.2116 67.2640 skb_dma_unmap
6711 126782 3.7595 71.0235 netif_receive_skb
6272 133054 3.5136 74.5371 udp_queue_rcv_skb
5262 138316 2.9478 77.4849 skb_release_data
5023 143339 2.8139 80.2988 __kmalloc_track_caller
4070 147409 2.2800 82.5788 kmem_cache_alloc
3216 150625 1.8016 84.3804 ipt_do_table
2576 153201 1.4431 85.8235 skb_queue_tail
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-03 21:02 ` Eric Dumazet
@ 2009-06-04 3:54 ` Rusty Russell
2009-06-04 4:00 ` David Miller
2009-06-04 4:00 ` David Miller
2009-06-04 3:54 ` Rusty Russell
1 sibling, 2 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-04 3:54 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, virtualization, Divy Le Ray, Roland Dreier,
Pavel Emelianov, Dan Williams, libertas-dev
On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> >> For TCP, I agree your patch is a huge benefit, since its paced by remote
> >> ACKS and window control
> >
> > I doubt that. There'll be some cache friendliness, but I'm not sure
> > it'll be measurable, let alone "huge".
...
> We can see sock_wfree() being number 2 on the profile, because it touches
> three cache lines per socket and transmited packet in TX completion
> handler.
Interesting, I take it back: got some "after" stats as well?
> Also, taking a reference on socket for each xmit packet in flight is very
> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
> cpus are fighting for sk->refcnt cache line.
Now we have decent dynamic per-cpu, we can finally implement bigrefs. More
obvious for device counts than sockets, but perhaps applicable here as well?
Rusty.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-03 21:02 ` Eric Dumazet
2009-06-04 3:54 ` Rusty Russell
@ 2009-06-04 3:54 ` Rusty Russell
1 sibling, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2009-06-04 3:54 UTC (permalink / raw)
To: Eric Dumazet
Cc: libertas-dev, Dan Williams, netdev, virtualization,
Roland Dreier, Divy Le Ray, Pavel Emelianov
On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
> Rusty Russell a écrit :
> > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote:
> >> For TCP, I agree your patch is a huge benefit, since its paced by remote
> >> ACKS and window control
> >
> > I doubt that. There'll be some cache friendliness, but I'm not sure
> > it'll be measurable, let alone "huge".
...
> We can see sock_wfree() being number 2 on the profile, because it touches
> three cache lines per socket and transmited packet in TX completion
> handler.
Interesting, I take it back: got some "after" stats as well?
> Also, taking a reference on socket for each xmit packet in flight is very
> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
> cpus are fighting for sk->refcnt cache line.
Now we have decent dynamic per-cpu, we can finally implement bigrefs. More
obvious for device counts than sockets, but perhaps applicable here as well?
Rusty.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-04 3:54 ` Rusty Russell
2009-06-04 4:00 ` David Miller
@ 2009-06-04 4:00 ` David Miller
2009-06-04 4:54 ` Eric Dumazet
1 sibling, 1 reply; 46+ messages in thread
From: David Miller @ 2009-06-04 4:00 UTC (permalink / raw)
To: rusty
Cc: eric.dumazet, netdev, virtualization, divy, rolandd, xemul, dcbw,
libertas-dev
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 4 Jun 2009 13:24:57 +0930
> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>> Also, taking a reference on socket for each xmit packet in flight is very
>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>> cpus are fighting for sk->refcnt cache line.
>
> Now we have decent dynamic per-cpu, we can finally implement bigrefs. More
> obvious for device counts than sockets, but perhaps applicable here as well?
It might be very beneficial for longer lasting, active, connections, but
for high connection rates it's going to be a lose in my estimation.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-04 3:54 ` Rusty Russell
@ 2009-06-04 4:00 ` David Miller
2009-06-04 4:00 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-04 4:00 UTC (permalink / raw)
To: rusty
Cc: eric.dumazet, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 4 Jun 2009 13:24:57 +0930
> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>> Also, taking a reference on socket for each xmit packet in flight is very
>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>> cpus are fighting for sk->refcnt cache line.
>
> Now we have decent dynamic per-cpu, we can finally implement bigrefs. More
> obvious for device counts than sockets, but perhaps applicable here as well?
It might be very beneficial for longer lasting, active, connections, but
for high connection rates it's going to be a lose in my estimation.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-04 4:00 ` David Miller
@ 2009-06-04 4:54 ` Eric Dumazet
2009-06-04 4:56 ` David Miller
2009-06-04 4:56 ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit David Miller
0 siblings, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-04 4:54 UTC (permalink / raw)
To: David Miller
Cc: libertas-dev, dcbw, netdev, virtualization, rolandd, divy, xemul
David Miller a écrit :
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Thu, 4 Jun 2009 13:24:57 +0930
>
>> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote:
>>> Also, taking a reference on socket for each xmit packet in flight is very
>>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several
>>> cpus are fighting for sk->refcnt cache line.
>> Now we have decent dynamic per-cpu, we can finally implement bigrefs. More
>> obvious for device counts than sockets, but perhaps applicable here as well?
>
> It might be very beneficial for longer lasting, active, connections, but
> for high connection rates it's going to be a lose in my estimation.
Agreed.
We also can avoid the sock_put()/sock_hold() pair for each tx packet,
to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
and atomic_dec_test in sk_free
We could initialize sk->sk_wmem_alloc to one instead of 0, so that
sock_wfree() could just synchronize itself with sk_free()
void sk_free(struct sock *sk)
{
if (atomic_dec_test(&sk->sk_wmem_alloc))
__sk_free(sk)
}
static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
- sock_hold(sk);
skb->sk = sk;
skb->destructor = sock_wfree;
atomic_add(skb->truesize, &sk->sk_wmem_alloc);
}
void sock_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
+ int res;
/* In case it might be waiting for more memory. */
- atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+ res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
sk->sk_write_space(sk);
- sock_put(sk);
+ if (res == 0)
+ __sk_free(sk);
}
Patch will follow after some testing
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-04 4:54 ` Eric Dumazet
@ 2009-06-04 4:56 ` David Miller
2009-06-04 9:18 ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
2009-06-04 4:56 ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit David Miller
1 sibling, 1 reply; 46+ messages in thread
From: David Miller @ 2009-06-04 4:56 UTC (permalink / raw)
To: eric.dumazet
Cc: rusty, netdev, virtualization, divy, rolandd, xemul, dcbw, libertas-dev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 06:54:24 +0200
> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
> and atomic_dec_test in sk_free
>
> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
> sock_wfree() could just synchronize itself with sk_free()
Excellent idea Eric.
> Patch will follow after some testing
I look forward to it :-)
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-04 4:54 ` Eric Dumazet
2009-06-04 4:56 ` David Miller
@ 2009-06-04 4:56 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-04 4:56 UTC (permalink / raw)
To: eric.dumazet
Cc: libertas-dev, dcbw, netdev, virtualization, rolandd, divy, xemul
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 06:54:24 +0200
> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
> and atomic_dec_test in sk_free
>
> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
> sock_wfree() could just synchronize itself with sk_free()
Excellent idea Eric.
> Patch will follow after some testing
I look forward to it :-)
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
2009-06-04 4:56 ` David Miller
@ 2009-06-04 9:18 ` Eric Dumazet
2009-06-04 9:26 ` David Miller
2009-06-10 8:17 ` David Miller
0 siblings, 2 replies; 46+ messages in thread
From: Eric Dumazet @ 2009-06-04 9:18 UTC (permalink / raw)
To: David Miller; +Cc: rusty, netdev
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 04 Jun 2009 06:54:24 +0200
>
>> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
>> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
>> and atomic_dec_test in sk_free
>>
>> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
>> sock_wfree() could just synchronize itself with sk_free()
>
> Excellent idea Eric.
Thanks !
>
>> Patch will follow after some testing
>
> I look forward to it :-)
Here it is, based on net-next-2.6
CC list trimmed down...
Next step would be to get rid of sk_callback_lock, I dont remember
if we already discussed about that rwlock, after RCUification of sockets...
Thanks
[PATCH] net: No more expensive sock_hold()/sock_put() on each tx
One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.
This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.
We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.
We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.
As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.
skb_set_owner_w() doesnt call sock_hold() anymore
sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.
Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.
Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/sock.h | 6 +++++-
net/core/sock.c | 29 +++++++++++++++++++++++++----
net/ipv4/ip_output.c | 1 -
net/ipv6/ip6_output.c | 1 -
4 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..010e14a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1217,9 +1217,13 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from,
static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
- sock_hold(sk);
skb->sk = sk;
skb->destructor = sock_wfree;
+ /*
+ * We used to take a refcount on sk, but following operation
+ * is enough to guarantee sk_free() wont free this sock until
+ * all in-flight packets are completed
+ */
atomic_add(skb->truesize, &sk->sk_wmem_alloc);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 58dec9d..ce0159a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1005,7 +1005,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
}
EXPORT_SYMBOL(sk_alloc);
-void sk_free(struct sock *sk)
+static void __sk_free(struct sock *sk)
{
struct sk_filter *filter;
@@ -1028,6 +1028,17 @@ void sk_free(struct sock *sk)
put_net(sock_net(sk));
sk_prot_free(sk->sk_prot_creator, sk);
}
+
+void sk_free(struct sock *sk)
+{
+ /*
+ * We substract one from sk_wmem_alloc and can know if
+ * some packets are still in some tx queue.
+ * If not null, sock_wfree() will call __sk_free(sk) later
+ */
+ if (atomic_dec_and_test(&sk->sk_wmem_alloc))
+ __sk_free(sk);
+}
EXPORT_SYMBOL(sk_free);
/*
@@ -1068,7 +1079,10 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
newsk->sk_backlog.head = newsk->sk_backlog.tail = NULL;
atomic_set(&newsk->sk_rmem_alloc, 0);
- atomic_set(&newsk->sk_wmem_alloc, 0);
+ /*
+ * sk_wmem_alloc set to one (see sk_free() and sock_wfree())
+ */
+ atomic_set(&newsk->sk_wmem_alloc, 1);
atomic_set(&newsk->sk_omem_alloc, 0);
skb_queue_head_init(&newsk->sk_receive_queue);
skb_queue_head_init(&newsk->sk_write_queue);
@@ -1172,12 +1186,18 @@ void __init sk_init(void)
void sock_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
+ int res;
/* In case it might be waiting for more memory. */
- atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+ res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
sk->sk_write_space(sk);
- sock_put(sk);
+ /*
+ * if sk_wmem_alloc reached 0, we are last user and should
+ * free this sock, as sk_free() call could not do it.
+ */
+ if (res == 0)
+ __sk_free(sk);
}
EXPORT_SYMBOL(sock_wfree);
@@ -1816,6 +1836,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_stamp = ktime_set(-1L, 0);
atomic_set(&sk->sk_refcnt, 1);
+ atomic_set(&sk->sk_wmem_alloc, 1);
atomic_set(&sk->sk_drops, 0);
}
EXPORT_SYMBOL(sock_init_data);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3d6167f..badbfde 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -498,7 +498,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
BUG_ON(frag->sk);
if (skb->sk) {
- sock_hold(skb->sk);
frag->sk = skb->sk;
frag->destructor = sock_wfree;
truesizes += frag->truesize;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c8dc8e5..18b9630 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -680,7 +680,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
BUG_ON(frag->sk);
if (skb->sk) {
- sock_hold(skb->sk);
frag->sk = skb->sk;
frag->destructor = sock_wfree;
truesizes += frag->truesize;
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
2009-06-04 9:18 ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
@ 2009-06-04 9:26 ` David Miller
2009-06-10 8:17 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-04 9:26 UTC (permalink / raw)
To: eric.dumazet; +Cc: rusty, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 11:18:35 +0200
> Next step would be to get rid of sk_callback_lock, I dont remember
> if we already discussed about that rwlock, after RCUification of sockets...
We discussed it, but the RCU'ification of core socket destruction
is pretty expensive. Just do any simple connection rate based
test, like lat_connection from lmbench, and the effects are clear.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
2009-06-04 9:18 ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
2009-06-04 9:26 ` David Miller
@ 2009-06-10 8:17 ` David Miller
2009-06-10 8:30 ` Eric Dumazet
1 sibling, 1 reply; 46+ messages in thread
From: David Miller @ 2009-06-10 8:17 UTC (permalink / raw)
To: eric.dumazet; +Cc: rusty, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Jun 2009 11:18:35 +0200
> @@ -1172,12 +1186,18 @@ void __init sk_init(void)
> void sock_wfree(struct sk_buff *skb)
> {
> struct sock *sk = skb->sk;
> + int res;
>
> /* In case it might be waiting for more memory. */
> - atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
> + res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
> if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
> sk->sk_write_space(sk);
> - sock_put(sk);
> + /*
> + * if sk_wmem_alloc reached 0, we are last user and should
> + * free this sock, as sk_free() call could not do it.
> + */
> + if (res == 0)
> + __sk_free(sk);
> }
> EXPORT_SYMBOL(sock_wfree);
>
Eric, I don't understand this part, please enlighten me :-)
Just because we've liberated all of the write buffer space, that does
not mean that it's time to kill off the socket completely.
Right?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
2009-06-10 8:17 ` David Miller
@ 2009-06-10 8:30 ` Eric Dumazet
2009-06-11 9:56 ` David Miller
0 siblings, 1 reply; 46+ messages in thread
From: Eric Dumazet @ 2009-06-10 8:30 UTC (permalink / raw)
To: David Miller; +Cc: rusty, netdev
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 04 Jun 2009 11:18:35 +0200
>
>> @@ -1172,12 +1186,18 @@ void __init sk_init(void)
>> void sock_wfree(struct sk_buff *skb)
>> {
>> struct sock *sk = skb->sk;
>> + int res;
>>
>> /* In case it might be waiting for more memory. */
>> - atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
>> + res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
>> if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
>> sk->sk_write_space(sk);
>> - sock_put(sk);
>> + /*
>> + * if sk_wmem_alloc reached 0, we are last user and should
>> + * free this sock, as sk_free() call could not do it.
>> + */
>> + if (res == 0)
>> + __sk_free(sk);
>> }
>> EXPORT_SYMBOL(sock_wfree);
>>
>
> Eric, I don't understand this part, please enlighten me :-)
>
> Just because we've liberated all of the write buffer space, that does
> not mean that it's time to kill off the socket completely.
>
> Right?
Remember we initialize this field to one.
If we freed all write buffer space, final value is one, not zero.
res == 0 only if we both freed all write buffer space, *and* socket was
also refcounted to 0 (sk_free() then realized it could not yet call __sk_free())
So we cheat a litle bit, because of this offset of one, we might block a sender a litle bit earlier :)
Thank you
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
2009-06-10 8:30 ` Eric Dumazet
@ 2009-06-11 9:56 ` David Miller
0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2009-06-11 9:56 UTC (permalink / raw)
To: eric.dumazet; +Cc: rusty, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 10 Jun 2009 10:30:49 +0200
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Thu, 04 Jun 2009 11:18:35 +0200
>>
>> Eric, I don't understand this part, please enlighten me :-)
>>
>> Just because we've liberated all of the write buffer space, that does
>> not mean that it's time to kill off the socket completely.
>>
>> Right?
>
> Remember we initialize this field to one.
>
> If we freed all write buffer space, final value is one, not zero.
>
> res == 0 only if we both freed all write buffer space, *and* socket was
> also refcounted to 0 (sk_free() then realized it could not yet call __sk_free())
>
> So we cheat a litle bit, because of this offset of one, we might block a sender a litle bit earlier :)
Now I understand, thanks!
Patch applied.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-03 0:14 ` David Miller
2009-07-03 7:55 ` Herbert Xu
@ 2009-07-03 7:55 ` Herbert Xu
2009-07-04 3:02 ` David Miller
1 sibling, 1 reply; 46+ messages in thread
From: Herbert Xu @ 2009-07-03 7:55 UTC (permalink / raw)
To: David Miller
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
David Miller <davem@davemloft.net> wrote:
>
>> I think I'll do this in the driver for now, and let's revisit doing it
>> generically later?
>
> That might be the best course of action for the time being.
> This whole area is a rat's nest.
Calling skb_orphan like this should be forbidden. Apart from the
problems already raised, it is a sign that the driver is trying to
paper over a more serious issue of not cleaning up skb's timely.
Yes skb_orphan will work for the cases where calling the skb
destructor allows forward progress, but for the cases where you
really need to the skb to be freed (e.g., iSCSI or Xen), this
simply doesn't work.
So anytime someone tries to propose such a solution it is a sign
that they have bigger problems.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-06-03 0:14 ` David Miller
@ 2009-07-03 7:55 ` Herbert Xu
2009-07-03 7:55 ` Herbert Xu
1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-03 7:55 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
David Miller <davem@davemloft.net> wrote:
>
>> I think I'll do this in the driver for now, and let's revisit doing it
>> generically later?
>
> That might be the best course of action for the time being.
> This whole area is a rat's nest.
Calling skb_orphan like this should be forbidden. Apart from the
problems already raised, it is a sign that the driver is trying to
paper over a more serious issue of not cleaning up skb's timely.
Yes skb_orphan will work for the cases where calling the skb
destructor allows forward progress, but for the cases where you
really need to the skb to be freed (e.g., iSCSI or Xen), this
simply doesn't work.
So anytime someone tries to propose such a solution it is a sign
that they have bigger problems.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-03 7:55 ` Herbert Xu
@ 2009-07-04 3:02 ` David Miller
2009-07-04 3:08 ` Herbert Xu
2009-07-04 3:08 ` Herbert Xu
0 siblings, 2 replies; 46+ messages in thread
From: David Miller @ 2009-07-04 3:02 UTC (permalink / raw)
To: herbert
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 3 Jul 2009 15:55:30 +0800
> Calling skb_orphan like this should be forbidden. Apart from the
> problems already raised, it is a sign that the driver is trying to
> paper over a more serious issue of not cleaning up skb's timely.
>
> Yes skb_orphan will work for the cases where calling the skb
> destructor allows forward progress, but for the cases where you
> really need to the skb to be freed (e.g., iSCSI or Xen), this
> simply doesn't work.
>
> So anytime someone tries to propose such a solution it is a sign
> that they have bigger problems.
Agreed, but alas we are foaming at the mouth until we have a truly
usable alternative.
In particular the case of handling a device without usable TX
completion event indications is still quite troublesome.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 3:02 ` David Miller
@ 2009-07-04 3:08 ` Herbert Xu
2009-07-04 3:13 ` David Miller
2009-07-04 3:13 ` David Miller
2009-07-04 3:08 ` Herbert Xu
1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04 3:08 UTC (permalink / raw)
To: David Miller
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>
> In particular the case of handling a device without usable TX
> completion event indications is still quite troublesome.
Which particular devices do you have in mind?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 3:02 ` David Miller
2009-07-04 3:08 ` Herbert Xu
@ 2009-07-04 3:08 ` Herbert Xu
1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04 3:08 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>
> In particular the case of handling a device without usable TX
> completion event indications is still quite troublesome.
Which particular devices do you have in mind?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 3:08 ` Herbert Xu
@ 2009-07-04 3:13 ` David Miller
2009-07-04 7:42 ` Herbert Xu
2009-07-04 7:42 ` Herbert Xu
2009-07-04 3:13 ` David Miller
1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2009-07-04 3:13 UTC (permalink / raw)
To: herbert
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 4 Jul 2009 11:08:30 +0800
> On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>>
>> In particular the case of handling a device without usable TX
>> completion event indications is still quite troublesome.
> e
> Which particular devices do you have in mind?
NIU
I basically can't defer interrupts because the chip supports
per-TX-desc interrupt indications but it lacks an "all TX queue sent"
event. So if, say, tell it to interrupt every 1/4 of the TX queue
then up to 1/4 of the queue can have packets "stuck" in there
if TX activity all of a sudden ceases.
The only thing I've come up with to be able to mitigate interrupts is
to use an hrtimer of some sort. But that's going to be hard to get
right, and who knows what kind of latencies will be introduced for TX
completion packet freeing unless I am very carefull.
And finally this belongs in generic code, not in the NIU driver,
whatever we come up with. Especially since my understanding is that
this is similar to what Rusty needs.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 3:08 ` Herbert Xu
2009-07-04 3:13 ` David Miller
@ 2009-07-04 3:13 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-07-04 3:13 UTC (permalink / raw)
To: herbert
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 4 Jul 2009 11:08:30 +0800
> On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote:
>>
>> In particular the case of handling a device without usable TX
>> completion event indications is still quite troublesome.
> e
> Which particular devices do you have in mind?
NIU
I basically can't defer interrupts because the chip supports
per-TX-desc interrupt indications but it lacks an "all TX queue sent"
event. So if, say, tell it to interrupt every 1/4 of the TX queue
then up to 1/4 of the queue can have packets "stuck" in there
if TX activity all of a sudden ceases.
The only thing I've come up with to be able to mitigate interrupts is
to use an hrtimer of some sort. But that's going to be hard to get
right, and who knows what kind of latencies will be introduced for TX
completion packet freeing unless I am very carefull.
And finally this belongs in generic code, not in the NIU driver,
whatever we come up with. Especially since my understanding is that
this is similar to what Rusty needs.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 3:13 ` David Miller
@ 2009-07-04 7:42 ` Herbert Xu
2009-07-04 9:09 ` Herbert Xu
2009-07-04 9:09 ` Herbert Xu
2009-07-04 7:42 ` Herbert Xu
1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04 7:42 UTC (permalink / raw)
To: David Miller
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
On Fri, Jul 03, 2009 at 08:13:47PM -0700, David Miller wrote:
>
> NIU
>
> I basically can't defer interrupts because the chip supports
> per-TX-desc interrupt indications but it lacks an "all TX queue sent"
> event. So if, say, tell it to interrupt every 1/4 of the TX queue
> then up to 1/4 of the queue can have packets "stuck" in there
> if TX activity all of a sudden ceases.
Here's an idea: We let the sender decide whether we need to enable
notification. This decision would be carried as a flag in the skb.
For example, UDP would set this flag when its socket buffer is close
to capacity. Routing would set this flag per NAPI run, etc.
Of course you'd ignore this flag completely if the qdisc queue is
non-empty.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 3:13 ` David Miller
2009-07-04 7:42 ` Herbert Xu
@ 2009-07-04 7:42 ` Herbert Xu
1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04 7:42 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
On Fri, Jul 03, 2009 at 08:13:47PM -0700, David Miller wrote:
>
> NIU
>
> I basically can't defer interrupts because the chip supports
> per-TX-desc interrupt indications but it lacks an "all TX queue sent"
> event. So if, say, tell it to interrupt every 1/4 of the TX queue
> then up to 1/4 of the queue can have packets "stuck" in there
> if TX activity all of a sudden ceases.
Here's an idea: We let the sender decide whether we need to enable
notification. This decision would be carried as a flag in the skb.
For example, UDP would set this flag when its socket buffer is close
to capacity. Routing would set this flag per NAPI run, etc.
Of course you'd ignore this flag completely if the qdisc queue is
non-empty.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 7:42 ` Herbert Xu
@ 2009-07-04 9:09 ` Herbert Xu
2009-07-05 3:26 ` Herbert Xu
2009-07-04 9:09 ` Herbert Xu
1 sibling, 1 reply; 46+ messages in thread
From: Herbert Xu @ 2009-07-04 9:09 UTC (permalink / raw)
To: David Miller
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
On Sat, Jul 04, 2009 at 03:42:45PM +0800, Herbert Xu wrote:
>
> Here's an idea: We let the sender decide whether we need to enable
> notification. This decision would be carried as a flag in the skb.
> For example, UDP would set this flag when its socket buffer is close
> to capacity. Routing would set this flag per NAPI run, etc.
Actually it doesn't even matter for routing because only those
that are charged by the skb's or the pages care and they're the
only ones that would need to set this.
One potential problem is if the socket is constantly running
close to capacity, but that should only happen if the device
TX queue is also close to capacity which means that the qdisc
queue should be non-empty.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 7:42 ` Herbert Xu
2009-07-04 9:09 ` Herbert Xu
@ 2009-07-04 9:09 ` Herbert Xu
1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-04 9:09 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
On Sat, Jul 04, 2009 at 03:42:45PM +0800, Herbert Xu wrote:
>
> Here's an idea: We let the sender decide whether we need to enable
> notification. This decision would be carried as a flag in the skb.
> For example, UDP would set this flag when its socket buffer is close
> to capacity. Routing would set this flag per NAPI run, etc.
Actually it doesn't even matter for routing because only those
that are charged by the skb's or the pages care and they're the
only ones that would need to set this.
One potential problem is if the socket is constantly running
close to capacity, but that should only happen if the device
TX queue is also close to capacity which means that the qdisc
queue should be non-empty.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-04 9:09 ` Herbert Xu
@ 2009-07-05 3:26 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
0 siblings, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-05 3:26 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
On Sat, Jul 04, 2009 at 05:09:10PM +0800, Herbert Xu wrote:
>
> One potential problem is if the socket is constantly running
> close to capacity, but that should only happen if the device
> TX queue is also close to capacity which means that the qdisc
> queue should be non-empty.
Here's a another crazy idea:
Let's use dummy TX descriptors to generate an interrupt, either
with or without transmitting an actual packet on the wire depending
on the NIC.
xmit(skb)
if (TX queue contains no interrupting descriptor &&
qdisc is empty)
mark TX descriptor as interrupting
clean()
do work
if (TX queue contains no interrupting descriptor &&
TX queue non-empty &&
qdisc is empty)
send dummy TX descriptor
This is based on the assumption that in the time it takes for
the NIC to process one packet and interrupt us, we can generate
more packets. Obviously if we can't then even if the NIC had
a queue-empty interrupt capability it wouldn't help.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-05 3:26 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
@ 2009-07-05 3:34 ` Herbert Xu
2009-08-18 1:47 ` David Miller
2009-08-18 1:47 ` David Miller
1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-05 3:34 UTC (permalink / raw)
To: David Miller
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
On Sun, Jul 05, 2009 at 11:26:58AM +0800, Herbert Xu wrote:
>
> Here's a another crazy idea:
>
> Let's use dummy TX descriptors to generate an interrupt, either
> with or without transmitting an actual packet on the wire depending
> on the NIC.
Here's an even crazier idea that doesn't use dummy descriptors.
xmit(skb)
if (TX queue contains no interrupting descriptor &&
qdisc is empty)
mark TX descriptor as interrupting
if (TX queue now contains an interrupting descriptor &&
qdisc len < 2)
stop queue
if (TX ring full)
stop queue
clean()
do work
wake queue as per usual
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-05 3:26 ` Herbert Xu
@ 2009-07-05 3:34 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-07-05 3:34 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
On Sun, Jul 05, 2009 at 11:26:58AM +0800, Herbert Xu wrote:
>
> Here's a another crazy idea:
>
> Let's use dummy TX descriptors to generate an interrupt, either
> with or without transmitting an actual packet on the wire depending
> on the NIC.
Here's an even crazier idea that doesn't use dummy descriptors.
xmit(skb)
if (TX queue contains no interrupting descriptor &&
qdisc is empty)
mark TX descriptor as interrupting
if (TX queue now contains an interrupting descriptor &&
qdisc len < 2)
stop queue
if (TX ring full)
stop queue
clean()
do work
wake queue as per usual
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-05 3:34 ` Herbert Xu
@ 2009-08-18 1:47 ` David Miller
2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:19 ` Herbert Xu
2009-08-18 1:47 ` David Miller
1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2009-08-18 1:47 UTC (permalink / raw)
To: herbert
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 5 Jul 2009 11:34:08 +0800
> Here's an even crazier idea that doesn't use dummy descriptors.
>
> xmit(skb)
>
> if (TX queue contains no interrupting descriptor &&
> qdisc is empty)
> mark TX descriptor as interrupting
>
> if (TX queue now contains an interrupting descriptor &&
> qdisc len < 2)
> stop queue
>
> if (TX ring full)
> stop queue
>
> clean()
>
> do work
> wake queue as per usual
I'm pretty sure that for normal TCP and UDP workloads, this is just
going to set the interrupt bit on the first packet that gets into the
queue, and then not in the rest.
TCP just loops over packets in the send queue, and at initial state
the qdisc will be empty.
It's very hard to get this to work as well as if we had a real
queue empty interrupt status event.
Even if you get upstream status from the protocols saying "there's
more packets coming" via some flag in the SKB, that only says what one
client feeding the TX ring is about to do.
It says nothing about other threads of control which are about to start
feeding packets to the same device.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-07-05 3:34 ` Herbert Xu
2009-08-18 1:47 ` David Miller
@ 2009-08-18 1:47 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-08-18 1:47 UTC (permalink / raw)
To: herbert
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 5 Jul 2009 11:34:08 +0800
> Here's an even crazier idea that doesn't use dummy descriptors.
>
> xmit(skb)
>
> if (TX queue contains no interrupting descriptor &&
> qdisc is empty)
> mark TX descriptor as interrupting
>
> if (TX queue now contains an interrupting descriptor &&
> qdisc len < 2)
> stop queue
>
> if (TX ring full)
> stop queue
>
> clean()
>
> do work
> wake queue as per usual
I'm pretty sure that for normal TCP and UDP workloads, this is just
going to set the interrupt bit on the first packet that gets into the
queue, and then not in the rest.
TCP just loops over packets in the send queue, and at initial state
the qdisc will be empty.
It's very hard to get this to work as well as if we had a real
queue empty interrupt status event.
Even if you get upstream status from the protocols saying "there's
more packets coming" via some flag in the SKB, that only says what one
client feeding the TX ring is about to do.
It says nothing about other threads of control which are about to start
feeding packets to the same device.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-08-18 1:47 ` David Miller
2009-08-19 3:19 ` Herbert Xu
@ 2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:34 ` David Miller
2009-08-19 3:34 ` David Miller
1 sibling, 2 replies; 46+ messages in thread
From: Herbert Xu @ 2009-08-19 3:19 UTC (permalink / raw)
To: David Miller
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
On Mon, Aug 17, 2009 at 06:47:12PM -0700, David Miller wrote:
>
> I'm pretty sure that for normal TCP and UDP workloads, this is just
> going to set the interrupt bit on the first packet that gets into the
> queue, and then not in the rest.
>
> TCP just loops over packets in the send queue, and at initial state
> the qdisc will be empty.
The scheme I actually tried on the e1000e is not quite the same
as what I had here. I'm basically just adding descriptors per
usual.
However, I don't give them to the hardware immediately. Instead
they're held back so that on average we have about three descriptors
in the queue spaced equally that will cause interrupts. It seems
to work fairly well for netperf-generated TCP/UDP traffic in terms
of generating the right amount of interrupts (once every qlen/3
packets).
I haven't posted anything yet because I ran into a weird problem
with the e1000e. It was generating a lot more interrupts than
what I'm telling it to do. Even if I hard-code the interrupts
at 0, qlen/3, 2qlen/2 I still get way more than qlen/3 interrupts
for qlen packets.
This may be related to the fact that e1000e doesn't really have
a way of turning the interrupts off for a given descriptor. Well
actually it does but it also turns off status reporting which means
that we will never know whether that entry has finished processing.
So I've had to rely on using its TX interrupt delay mechanism as an
approximation of turning interrupt notification off. Evidently that
is not working in the way I intended it to.
I'm in the process of repeating the same experiment with cxgb3
which hopefully should let me turn interrupts off on descriptors
while still reporting completion status.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-08-18 1:47 ` David Miller
@ 2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:19 ` Herbert Xu
1 sibling, 0 replies; 46+ messages in thread
From: Herbert Xu @ 2009-08-19 3:19 UTC (permalink / raw)
To: David Miller
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
On Mon, Aug 17, 2009 at 06:47:12PM -0700, David Miller wrote:
>
> I'm pretty sure that for normal TCP and UDP workloads, this is just
> going to set the interrupt bit on the first packet that gets into the
> queue, and then not in the rest.
>
> TCP just loops over packets in the send queue, and at initial state
> the qdisc will be empty.
The scheme I actually tried on the e1000e is not quite the same
as what I had here. I'm basically just adding descriptors per
usual.
However, I don't give them to the hardware immediately. Instead
they're held back so that on average we have about three descriptors
in the queue spaced equally that will cause interrupts. It seems
to work fairly well for netperf-generated TCP/UDP traffic in terms
of generating the right amount of interrupts (once every qlen/3
packets).
I haven't posted anything yet because I ran into a weird problem
with the e1000e. It was generating a lot more interrupts than
what I'm telling it to do. Even if I hard-code the interrupts
at 0, qlen/3, 2qlen/2 I still get way more than qlen/3 interrupts
for qlen packets.
This may be related to the fact that e1000e doesn't really have
a way of turning the interrupts off for a given descriptor. Well
actually it does but it also turns off status reporting which means
that we will never know whether that entry has finished processing.
So I've had to rely on using its TX interrupt delay mechanism as an
approximation of turning interrupt notification off. Evidently that
is not working in the way I intended it to.
I'm in the process of repeating the same experiment with cxgb3
which hopefully should let me turn interrupts off on descriptors
while still reporting completion status.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:34 ` David Miller
@ 2009-08-19 3:34 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-08-19 3:34 UTC (permalink / raw)
To: herbert
Cc: rusty, patrick.ohly, netdev, virtualization, divy, rolandd,
xemul, dcbw, libertas-dev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 Aug 2009 13:19:26 +1000
> I'm in the process of repeating the same experiment with cxgb3
> which hopefully should let me turn interrupts off on descriptors
> while still reporting completion status.
Ok, I look forward to seeing your work however it turns out.
Once I see what you've done, I'll give it a spin on niu.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
2009-08-19 3:19 ` Herbert Xu
@ 2009-08-19 3:34 ` David Miller
2009-08-19 3:34 ` David Miller
1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2009-08-19 3:34 UTC (permalink / raw)
To: herbert
Cc: patrick.ohly, libertas-dev, dcbw, netdev, virtualization,
rolandd, divy, xemul
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 Aug 2009 13:19:26 +1000
> I'm in the process of repeating the same experiment with cxgb3
> which hopefully should let me turn interrupts off on descriptors
> while still reporting completion status.
Ok, I look forward to seeing your work however it turns out.
Once I see what you've done, I'll give it a spin on niu.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit
@ 2009-05-29 14:14 Rusty Russell
0 siblings, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2009-05-29 14:14 UTC (permalink / raw)
To: netdev
Cc: libertas-dev, Dan Williams, virtualization, Roland Dreier,
Divy Le Ray, Pavel Emelianov
Various drivers do skb_orphan() because they do not free transmitted
skbs in a timely manner (causing apps which hit their socket limits to
stall, socket close to hang, etc.).
DaveM points out that there are advantages to doing it generally (it's
more likely to be on same CPU than after xmit), and I couldn't find
any new starvation issues in simple benchmarking here.
This patch adds skb_orphan to the start of dev_hard_start_xmit(): it
can be premature in the NETDEV_TX_BUSY case, but that's uncommon.
I removed the drivers' skb_orphan(), though it's harmless.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Divy Le Ray <divy@chelsio.com>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dan Williams <dcbw@redhat.com>
Cc: libertas-dev@lists.infradead.org
---
drivers/net/cxgb3/sge.c | 27 ---------------------------
drivers/net/loopback.c | 2 --
drivers/net/mlx4/en_tx.c | 4 ----
drivers/net/niu.c | 3 +--
drivers/net/veth.c | 2 --
drivers/net/wireless/libertas/tx.c | 3 ---
net/core/dev.c | 21 +++++----------------
7 files changed, 6 insertions(+), 56 deletions(-)
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str
dev->trans_start = jiffies;
spin_unlock(&q->lock);
- /*
- * We do not use Tx completion interrupts to free DMAd Tx packets.
- * This is good for performamce but means that we rely on new Tx
- * packets arriving to run the destructors of completed packets,
- * which open up space in their sockets' send queues. Sometimes
- * we do not get such new packets causing Tx to stall. A single
- * UDP transmitter is a good example of this situation. We have
- * a clean up timer that periodically reclaims completed packets
- * but it doesn't run often enough (nor do we want it to) to prevent
- * lengthy stalls. A solution to this problem is to run the
- * destructor early, after the packet is queued but before it's DMAd.
- * A cons is that we lie to socket memory accounting, but the amount
- * of extra memory is reasonable (limited by the number of Tx
- * descriptors), the packets do actually get freed quickly by new
- * packets almost always, and for protocols like TCP that wait for
- * acks to really free up the data the extra memory is even less.
- * On the positive side we run the destructors on the sending CPU
- * rather than on a potentially different completing CPU, usually a
- * good thing. We also run them without holding our Tx queue lock,
- * unlike what reclaim_completed_tx() would otherwise do.
- *
- * Run the destructor before telling the DMA engine about the packet
- * to make sure it doesn't complete and get freed prematurely.
- */
- if (likely(!skb_shared(skb)))
- skb_orphan(skb);
-
write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl);
check_ring_tx_db(adap, q);
return NETDEV_TX_OK;
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff
{
struct pcpu_lstats *pcpu_lstats, *lb_stats;
- skb_orphan(skb);
-
skb->protocol = eth_type_trans(skb,dev);
/* it's OK to use per_cpu_ptr() because BHs are off */
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st
if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf)
tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
- /* Run destructor before passing skb to HW */
- if (likely(!skb_shared(skb)))
- skb_orphan(skb);
-
/* Ensure new descirptor hits memory
* before setting ownership of this descriptor to HW */
wmb();
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff
}
kfree_skb(skb);
skb = skb_new;
- } else
- skb_orphan(skb);
+ }
align = ((unsigned long) skb->data & (16 - 1));
headroom = align + sizeof(struct tx_pkt_hdr);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb
struct veth_net_stats *stats, *rcv_stats;
int length, cpu;
- skb_orphan(skb);
-
priv = netdev_priv(dev);
rcv = priv->peer;
rcv_priv = netdev_priv(rcv);
diff --git a/drivers/net/wireless/libertas/tx.c
b/drivers/net/wireless/libertas/tx.c
--- a/drivers/net/wireless/libertas/tx.c
+++ b/drivers/net/wireless/libertas/tx.c
@@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff *
if (priv->monitormode) {
/* Keep the skb to echo it back once Tx feedback is
received from FW */
- skb_orphan(skb);
-
- /* Keep the skb around for when we get feedback */
priv->currenttxskb = skb;
} else {
free:
diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff *
const struct net_device_ops *ops = dev->netdev_ops;
int rc;
+ /* Call destructor here. It's SMP-cache-friendly and avoids issues
+ * with drivers which can hold transmitted skbs for long times */
+ skb_orphan(skb);
+
if (likely(!skb->next)) {
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
@@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff *
goto gso;
}
- rc = ops->ndo_start_xmit(skb, dev);
- /*
- * TODO: if skb_orphan() was called by
- * dev->hard_start_xmit() (for example, the unmodified
- * igb driver does that; bnx2 doesn't), then
- * skb_tx_software_timestamp() will be unable to send
- * back the time stamp.
- *
- * How can this be prevented? Always create another
- * reference to the socket before calling
- * dev->hard_start_xmit()? Prevent that skb_orphan()
- * does anything in dev->hard_start_xmit() by clearing
- * the skb destructor before the call and restoring it
- * afterwards, then doing the skb_orphan() ourselves?
- */
- return rc;
+ return ops->ndo_start_xmit(skb, dev);
}
gso:
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2009-08-19 3:34 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
2009-06-01 12:27 ` Rusty Russell
2009-06-03 21:02 ` Eric Dumazet
2009-06-03 21:02 ` Eric Dumazet
2009-06-04 3:54 ` Rusty Russell
2009-06-04 4:00 ` David Miller
2009-06-04 4:00 ` David Miller
2009-06-04 4:54 ` Eric Dumazet
2009-06-04 4:56 ` David Miller
2009-06-04 9:18 ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Eric Dumazet
2009-06-04 9:26 ` David Miller
2009-06-10 8:17 ` David Miller
2009-06-10 8:30 ` Eric Dumazet
2009-06-11 9:56 ` David Miller
2009-06-04 4:56 ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit David Miller
2009-06-04 3:54 ` Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
2009-06-01 19:47 ` Patrick Ohly
2009-06-01 19:47 ` Patrick Ohly
2009-06-02 7:25 ` David Miller
2009-06-02 7:25 ` David Miller
2009-06-02 14:08 ` Rusty Russell
2009-06-03 0:14 ` David Miller
2009-07-03 7:55 ` Herbert Xu
2009-07-03 7:55 ` Herbert Xu
2009-07-04 3:02 ` David Miller
2009-07-04 3:08 ` Herbert Xu
2009-07-04 3:13 ` David Miller
2009-07-04 7:42 ` Herbert Xu
2009-07-04 9:09 ` Herbert Xu
2009-07-05 3:26 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
2009-08-18 1:47 ` David Miller
2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:34 ` David Miller
2009-08-19 3:34 ` David Miller
2009-08-18 1:47 ` David Miller
2009-07-04 9:09 ` Herbert Xu
2009-07-04 7:42 ` Herbert Xu
2009-07-04 3:13 ` David Miller
2009-07-04 3:08 ` Herbert Xu
2009-06-02 14:08 ` Rusty Russell
-- strict thread matches above, loose matches on Subject: below --
2009-05-29 14:14 Rusty Russell
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.