All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression on TX throughput when using bonding
@ 2012-06-14  8:58 Jean-Michel Hautbois
  2012-06-14  9:21 ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jean-Michel Hautbois @ 2012-06-14  8:58 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

Hi all,

I have bisected a regression which concerns TX throughput when using bonding.
I tested only with 10Gbps cards, as it appears when bandwidth need is
over 1Gbps on my machine.
I send UDP multicast packets over bonding and observe the tc result.

When KO :
$>tc -s -d qdisc show dev eth1
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1
1 1 1 1 1 1
 Sent 1106527591 bytes 273802 pkt (dropped 306419, overlimits 0 requeues 223)
 backlog 0b 0p requeues 223

Ok course, when OK, dropped is 0.
$>tc -s -d qdisc show dev eth1
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1
1 1 1 1 1 1
 Sent 1648662087 bytes 408009 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0


Here is the incriminated commit:

fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5 is the first bad commit
commit fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Fri Apr 16 12:18:22 2010 +0000

    net: Introduce skb_orphan_try()

    Transmitted skb might be attached to a socket and a destructor, for
    memory accounting purposes.

    Traditionally, this destructor is called at tx completion time, when skb
    is freed.

    When tx completion is performed by another cpu than the sender, this
    forces some cache lines to change ownership. XPS was an attempt to give
    tx completion to initial cpu.

    David idea is to call destructor right before giving skb to device (call
    to ndo_start_xmit()). Because device queues are usually small, orphaning
    skb before tx completion is not a big deal. Some drivers already do
    this, we could do it in upper level.

    There is one known exception to this early orphaning, called tx
    timestamping. It needs to keep a reference to socket until device can
    give a hardware or software timestamp.

    This patch adds a skb_orphan_try() helper, to centralize all exceptions
    to early orphaning in one spot, and use it in dev_hard_start_xmit().

    "tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
    before: Throughput 4428.9 MB/sec 16 procs
    after: Throughput 4448.14 MB/sec 16 procs

    UDP should get even better results, its destructor being more complex,
    since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

JM

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14  8:58 Regression on TX throughput when using bonding Jean-Michel Hautbois
@ 2012-06-14  9:21 ` Eric Dumazet
  2012-06-14  9:40   ` Jean-Michel Hautbois
  2012-06-14  9:50   ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2012-06-14  9:21 UTC (permalink / raw)
  To: Jean-Michel Hautbois; +Cc: netdev

On Thu, 2012-06-14 at 10:58 +0200, Jean-Michel Hautbois wrote:
> Hi all,
> 
> I have bisected a regression which concerns TX throughput when using bonding.
> I tested only with 10Gbps cards, as it appears when bandwidth need is
> over 1Gbps on my machine.
> I send UDP multicast packets over bonding and observe the tc result.
> 
> When KO :
> $>tc -s -d qdisc show dev eth1
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1
> 1 1 1 1 1 1
>  Sent 1106527591 bytes 273802 pkt (dropped 306419, overlimits 0 requeues 223)
>  backlog 0b 0p requeues 223
> 
> Ok course, when OK, dropped is 0.
> $>tc -s -d qdisc show dev eth1
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1
> 1 1 1 1 1 1
>  Sent 1648662087 bytes 408009 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> 
> 
> Here is the incriminated commit:
> 
> fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5 is the first bad commit
> commit fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5

>     net: Introduce skb_orphan_try()

So you are saying that if you make skb_orphan_try() doing nothing, it
solves your problem ?

diff --git a/net/core/dev.c b/net/core/dev.c
index cd09819..6df40dd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2096,6 +2096,7 @@ static int dev_gso_segment(struct sk_buff *skb, netdev_features_t features)
  */
 static inline void skb_orphan_try(struct sk_buff *skb)
 {
+#if 0
 	struct sock *sk = skb->sk;
 
 	if (sk && !skb_shinfo(skb)->tx_flags) {
@@ -2106,6 +2107,7 @@ static inline void skb_orphan_try(struct sk_buff *skb)
 			skb->rxhash = sk->sk_hash;
 		skb_orphan(skb);
 	}
+#endif
 }
 
 static bool can_checksum_protocol(netdev_features_t features, __be16 protocol)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14  9:21 ` Eric Dumazet
@ 2012-06-14  9:40   ` Jean-Michel Hautbois
  2012-06-14  9:50   ` Eric Dumazet
  1 sibling, 0 replies; 15+ messages in thread
From: Jean-Michel Hautbois @ 2012-06-14  9:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2012/6/14 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2012-06-14 at 10:58 +0200, Jean-Michel Hautbois wrote:
>> Hi all,
>>
>> I have bisected a regression which concerns TX throughput when using bonding.
>> I tested only with 10Gbps cards, as it appears when bandwidth need is
>> over 1Gbps on my machine.
>> I send UDP multicast packets over bonding and observe the tc result.
>>
>> When KO :
>> $>tc -s -d qdisc show dev eth1
>> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1
>> 1 1 1 1 1 1
>>  Sent 1106527591 bytes 273802 pkt (dropped 306419, overlimits 0 requeues 223)
>>  backlog 0b 0p requeues 223
>>
>> Ok course, when OK, dropped is 0.
>> $>tc -s -d qdisc show dev eth1
>> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1
>> 1 1 1 1 1 1
>>  Sent 1648662087 bytes 408009 pkt (dropped 0, overlimits 0 requeues 0)
>>  backlog 0b 0p requeues 0
>>
>>
>> Here is the incriminated commit:
>>
>> fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5 is the first bad commit
>> commit fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5
>
>>     net: Introduce skb_orphan_try()
>
> So you are saying that if you make skb_orphan_try() doing nothing, it
> solves your problem ?
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cd09819..6df40dd 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2096,6 +2096,7 @@ static int dev_gso_segment(struct sk_buff *skb, netdev_features_t features)
>  */
>  static inline void skb_orphan_try(struct sk_buff *skb)
>  {
> +#if 0
>        struct sock *sk = skb->sk;
>
>        if (sk && !skb_shinfo(skb)->tx_flags) {
> @@ -2106,6 +2107,7 @@ static inline void skb_orphan_try(struct sk_buff *skb)
>                        skb->rxhash = sk->sk_hash;
>                skb_orphan(skb);
>        }
> +#endif
>  }
>
>  static bool can_checksum_protocol(netdev_features_t features, __be16 protocol)
>

Yes :

[09:44:46]root@debian:~# uname -r
3.2.0
[09:44:50]root@debian:~# tc -s -d qdisc show dev eth1
qdisc mq 0: root
 Sent 11786861758 bytes 2916943 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

JM

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14  9:21 ` Eric Dumazet
  2012-06-14  9:40   ` Jean-Michel Hautbois
@ 2012-06-14  9:50   ` Eric Dumazet
  2012-06-14 10:00     ` David Miller
  2012-06-14 10:15     ` Regression on TX throughput when using bonding Jean-Michel Hautbois
  1 sibling, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2012-06-14  9:50 UTC (permalink / raw)
  To: Jean-Michel Hautbois; +Cc: netdev

On Thu, 2012-06-14 at 11:22 +0200, Eric Dumazet wrote:

> So you are saying that if you make skb_orphan_try() doing nothing, it
> solves your problem ?

It probably does, if your application does an UDP flood, trying to send
more than the link bandwidth. I guess only benchmarks workloads ever try
to do that.

bonding has no way to give congestion back, it has no Qdisc by default.

We probably can defer the skb_orphan_try() for bonding master, a bit
like the IFF_XMIT_DST_RELEASE 

 drivers/net/bonding/bond_main.c |    2 +-
 include/linux/if.h              |    3 +++
 net/core/dev.c                  |    5 +++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 2ee8cf9..1b1e9c8 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4343,7 +4343,7 @@ static void bond_setup(struct net_device *bond_dev)
 	bond_dev->tx_queue_len = 0;
 	bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
 	bond_dev->priv_flags |= IFF_BONDING;
-	bond_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
+	bond_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING | IFF_XMIT_ORPHAN);
 
 	/* At first, we block adding VLANs. That's the only way to
 	 * prevent problems that occur when adding VLANs over an
diff --git a/include/linux/if.h b/include/linux/if.h
index f995c66..a788e7b 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -81,6 +81,9 @@
 #define IFF_UNICAST_FLT	0x20000		/* Supports unicast filtering	*/
 #define IFF_TEAM_PORT	0x40000		/* device used as team port */
 #define IFF_SUPP_NOFCS	0x80000		/* device supports sending custom FCS */
+#define IFF_XMIT_ORPHAN	0x100000	/* dev_hard_start_xmit() is allowed to
+					 * orphan skb
+					 */
 
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
diff --git a/net/core/dev.c b/net/core/dev.c
index cd09819..3435463 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2193,7 +2193,8 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
 
-		skb_orphan_try(skb);
+		if (dev->priv_flags & IFF_XMIT_ORPHAN)
+			skb_orphan_try(skb);
 
 		features = netif_skb_features(skb);
 
@@ -5929,7 +5930,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->napi_list);
 	INIT_LIST_HEAD(&dev->unreg_list);
 	INIT_LIST_HEAD(&dev->link_watch_list);
-	dev->priv_flags = IFF_XMIT_DST_RELEASE;
+	dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_ORPHAN;
 	setup(dev);
 
 	dev->num_tx_queues = txqs;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14  9:50   ` Eric Dumazet
@ 2012-06-14 10:00     ` David Miller
  2012-06-14 10:07       ` Eric Dumazet
  2012-06-14 10:15     ` Regression on TX throughput when using bonding Jean-Michel Hautbois
  1 sibling, 1 reply; 15+ messages in thread
From: David Miller @ 2012-06-14 10:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jhautbois, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 14 Jun 2012 11:50:17 +0200

> On Thu, 2012-06-14 at 11:22 +0200, Eric Dumazet wrote:
> 
>> So you are saying that if you make skb_orphan_try() doing nothing, it
>> solves your problem ?
> 
> It probably does, if your application does an UDP flood, trying to send
> more than the link bandwidth. I guess only benchmarks workloads ever try
> to do that.

Eric, I just want to point out that back when this early orphaning
idea were being proposed I warned about this, and specifically I
mentioned that, for datagram sockets, the socket send buffer limits
are what provide proper rate control and fairness.

It also, therefore, protects the system from one datagram spammer
being able to essentially take over the network interface and blocking
out all other users.

Early orphaning breaks this completely.

I guess we decided that moving an atomic operation earlier is worth
all of this?

Now we are so addicted to the increased performance from early
orphaning that I fear we'll never be allowed back into that sane
state of affairs ever again.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14 10:00     ` David Miller
@ 2012-06-14 10:07       ` Eric Dumazet
  2012-06-14 10:31         ` David Miller
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2012-06-14 10:07 UTC (permalink / raw)
  To: David Miller; +Cc: jhautbois, netdev

On Thu, 2012-06-14 at 03:00 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 14 Jun 2012 11:50:17 +0200
> 
> > On Thu, 2012-06-14 at 11:22 +0200, Eric Dumazet wrote:
> > 
> >> So you are saying that if you make skb_orphan_try() doing nothing, it
> >> solves your problem ?
> > 
> > It probably does, if your application does an UDP flood, trying to send
> > more than the link bandwidth. I guess only benchmarks workloads ever try
> > to do that.
> 
> Eric, I just want to point out that back when this early orphaning
> idea were being proposed I warned about this, and specifically I
> mentioned that, for datagram sockets, the socket send buffer limits
> are what provide proper rate control and fairness.

If I remember well, the argument was that if workload was using thousand
of sockets, the per socket limitation of in-flight packet would not save
you anyway. We would drop packets.


> It also, therefore, protects the system from one datagram spammer
> being able to essentially take over the network interface and blocking
> out all other users.
> 
> Early orphaning breaks this completely.
> 
> I guess we decided that moving an atomic operation earlier is worth
> all of this?

It was, but with BQL, we should have far less packets in TX rings, so it
might be different today (on BQL enabled NICS only)

> 
> Now we are so addicted to the increased performance from early
> orphaning that I fear we'll never be allowed back into that sane
> state of affairs ever again.


bonding (or other virtual devices) is special in the sense the
dev_hard_start_xmit() is called twice.

We should have a way to properly park packets in Qdiscs, and only do the
orphaning once skb given to real device for 'immediate or so'
transmission.

The pppoe thread is only another manifestation of the same problem.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14  9:50   ` Eric Dumazet
  2012-06-14 10:00     ` David Miller
@ 2012-06-14 10:15     ` Jean-Michel Hautbois
  2012-06-14 14:14       ` Jean-Michel Hautbois
  1 sibling, 1 reply; 15+ messages in thread
From: Jean-Michel Hautbois @ 2012-06-14 10:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2012/6/14 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2012-06-14 at 11:22 +0200, Eric Dumazet wrote:
>
>> So you are saying that if you make skb_orphan_try() doing nothing, it
>> solves your problem ?
>
> It probably does, if your application does an UDP flood, trying to send
> more than the link bandwidth. I guess only benchmarks workloads ever try
> to do that.
>
> bonding has no way to give congestion back, it has no Qdisc by default.
>
> We probably can defer the skb_orphan_try() for bonding master, a bit
> like the IFF_XMIT_DST_RELEASE
>
>  drivers/net/bonding/bond_main.c |    2 +-
>  include/linux/if.h              |    3 +++
>  net/core/dev.c                  |    5 +++--
>  3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 2ee8cf9..1b1e9c8 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4343,7 +4343,7 @@ static void bond_setup(struct net_device *bond_dev)
>        bond_dev->tx_queue_len = 0;
>        bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
>        bond_dev->priv_flags |= IFF_BONDING;
> -       bond_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
> +       bond_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING | IFF_XMIT_ORPHAN);
>
>        /* At first, we block adding VLANs. That's the only way to
>         * prevent problems that occur when adding VLANs over an
> diff --git a/include/linux/if.h b/include/linux/if.h
> index f995c66..a788e7b 100644
> --- a/include/linux/if.h
> +++ b/include/linux/if.h
> @@ -81,6 +81,9 @@
>  #define IFF_UNICAST_FLT        0x20000         /* Supports unicast filtering   */
>  #define IFF_TEAM_PORT  0x40000         /* device used as team port */
>  #define IFF_SUPP_NOFCS 0x80000         /* device supports sending custom FCS */
> +#define IFF_XMIT_ORPHAN        0x100000        /* dev_hard_start_xmit() is allowed to
> +                                        * orphan skb
> +                                        */
>
>
>  #define IF_GET_IFACE   0x0001          /* for querying only */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cd09819..3435463 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2193,7 +2193,8 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>                if (!list_empty(&ptype_all))
>                        dev_queue_xmit_nit(skb, dev);
>
> -               skb_orphan_try(skb);
> +               if (dev->priv_flags & IFF_XMIT_ORPHAN)
> +                       skb_orphan_try(skb);
>
>                features = netif_skb_features(skb);
>
> @@ -5929,7 +5930,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>        INIT_LIST_HEAD(&dev->napi_list);
>        INIT_LIST_HEAD(&dev->unreg_list);
>        INIT_LIST_HEAD(&dev->link_watch_list);
> -       dev->priv_flags = IFF_XMIT_DST_RELEASE;
> +       dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_ORPHAN;
>        setup(dev);
>
>        dev->num_tx_queues = txqs;
>
>

It works

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14 10:07       ` Eric Dumazet
@ 2012-06-14 10:31         ` David Miller
  2012-06-14 16:42           ` [PATCH] net: remove skb_orphan_try() Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2012-06-14 10:31 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jhautbois, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 14 Jun 2012 12:07:51 +0200

> We should have a way to properly park packets in Qdiscs, and only do the
> orphaning once skb given to real device for 'immediate or so'
> transmission.

Ok.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14 10:15     ` Regression on TX throughput when using bonding Jean-Michel Hautbois
@ 2012-06-14 14:14       ` Jean-Michel Hautbois
  2012-06-14 14:29         ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jean-Michel Hautbois @ 2012-06-14 14:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2012/6/14 Jean-Michel Hautbois <jhautbois@gmail.com>:
> 2012/6/14 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Thu, 2012-06-14 at 11:22 +0200, Eric Dumazet wrote:
>>
>>> So you are saying that if you make skb_orphan_try() doing nothing, it
>>> solves your problem ?
>>
>> It probably does, if your application does an UDP flood, trying to send
>> more than the link bandwidth. I guess only benchmarks workloads ever try
>> to do that.
>>
>> bonding has no way to give congestion back, it has no Qdisc by default.
>>
>> We probably can defer the skb_orphan_try() for bonding master, a bit
>> like the IFF_XMIT_DST_RELEASE
>>
>>  drivers/net/bonding/bond_main.c |    2 +-
>>  include/linux/if.h              |    3 +++
>>  net/core/dev.c                  |    5 +++--
>>  3 files changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 2ee8cf9..1b1e9c8 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -4343,7 +4343,7 @@ static void bond_setup(struct net_device *bond_dev)
>>        bond_dev->tx_queue_len = 0;
>>        bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
>>        bond_dev->priv_flags |= IFF_BONDING;
>> -       bond_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING);
>> +       bond_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | IFF_TX_SKB_SHARING | IFF_XMIT_ORPHAN);
>>
>>        /* At first, we block adding VLANs. That's the only way to
>>         * prevent problems that occur when adding VLANs over an
>> diff --git a/include/linux/if.h b/include/linux/if.h
>> index f995c66..a788e7b 100644
>> --- a/include/linux/if.h
>> +++ b/include/linux/if.h
>> @@ -81,6 +81,9 @@
>>  #define IFF_UNICAST_FLT        0x20000         /* Supports unicast filtering   */
>>  #define IFF_TEAM_PORT  0x40000         /* device used as team port */
>>  #define IFF_SUPP_NOFCS 0x80000         /* device supports sending custom FCS */
>> +#define IFF_XMIT_ORPHAN        0x100000        /* dev_hard_start_xmit() is allowed to
>> +                                        * orphan skb
>> +                                        */
>>
>>
>>  #define IF_GET_IFACE   0x0001          /* for querying only */
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index cd09819..3435463 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2193,7 +2193,8 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>>                if (!list_empty(&ptype_all))
>>                        dev_queue_xmit_nit(skb, dev);
>>
>> -               skb_orphan_try(skb);
>> +               if (dev->priv_flags & IFF_XMIT_ORPHAN)
>> +                       skb_orphan_try(skb);
>>
>>                features = netif_skb_features(skb);
>>
>> @@ -5929,7 +5930,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>>        INIT_LIST_HEAD(&dev->napi_list);
>>        INIT_LIST_HEAD(&dev->unreg_list);
>>        INIT_LIST_HEAD(&dev->link_watch_list);
>> -       dev->priv_flags = IFF_XMIT_DST_RELEASE;
>> +       dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_ORPHAN;
>>        setup(dev);
>>
>>        dev->num_tx_queues = txqs;
>>
>>
>
> It works
For your information :
~# tc -s -d qdisc show dev eth1 > before_tc && sleep 10 && tc -s -d
qdisc show dev eth1 > after_tc && ./beforeafter before_tc after_tc
qdisc mq 0: root
 Sent 3185900568 bytes 788681 pkt (dropped 0, overlimits 0 requeues 620)
 backlog 0b 0p requeues 620

As you can see, 2.5Gbps without any difficulties :).

Thanks,
JM

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14 14:14       ` Jean-Michel Hautbois
@ 2012-06-14 14:29         ` Eric Dumazet
  2012-06-14 15:43           ` Jean-Michel Hautbois
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2012-06-14 14:29 UTC (permalink / raw)
  To: Jean-Michel Hautbois; +Cc: netdev

On Thu, 2012-06-14 at 16:14 +0200, Jean-Michel Hautbois wrote:

> ~# tc -s -d qdisc show dev eth1 > before_tc && sleep 10 && tc -s -d
> qdisc show dev eth1 > after_tc && ./beforeafter before_tc after_tc
> qdisc mq 0: root
>  Sent 3185900568 bytes 788681 pkt (dropped 0, overlimits 0 requeues 620)
>  backlog 0b 0p requeues 620
> 
> As you can see, 2.5Gbps without any difficulties :).
> 
> Thanks,
> JM

I have no idea why throughput on ethernet link is changed.

There is another bug elsewhere.  Use a thousand of sockets instead of
few, and you'll hit the bug.

Orphaning skbs should not lower speed of the device, only drops excess
packets, instead of blocking the application, waiting the socket wmem
alloc being freed by destructors.

Are you playing with process priorities ?

If the ksoftirqd cannot run, this could explain the problem.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14 14:29         ` Eric Dumazet
@ 2012-06-14 15:43           ` Jean-Michel Hautbois
  2012-06-14 17:46             ` Rick Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Jean-Michel Hautbois @ 2012-06-14 15:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2012/6/14 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2012-06-14 at 16:14 +0200, Jean-Michel Hautbois wrote:
>
>> ~# tc -s -d qdisc show dev eth1 > before_tc && sleep 10 && tc -s -d
>> qdisc show dev eth1 > after_tc && ./beforeafter before_tc after_tc
>> qdisc mq 0: root
>>  Sent 3185900568 bytes 788681 pkt (dropped 0, overlimits 0 requeues 620)
>>  backlog 0b 0p requeues 620
>>
>> As you can see, 2.5Gbps without any difficulties :).
>>
>> Thanks,
>> JM
>
> I have no idea why throughput on ethernet link is changed.
>
> There is another bug elsewhere.  Use a thousand of sockets instead of
> few, and you'll hit the bug.
>
> Orphaning skbs should not lower speed of the device, only drops excess
> packets, instead of blocking the application, waiting the socket wmem
> alloc being freed by destructors.
>
> Are you playing with process priorities ?
>
> If the ksoftirqd cannot run, this could explain the problem.
>

As suggested by Eric, here is a description I wish to be as precise as possible.
I send three RAW video frames, 1920x1088@30fps on three udp sockets to
the same NIC.
Each sending is in a thread, so I will focus on the numbers for one thread.

This generates burst of send(), as this : each 1/30s send 3.133.440
bytes to the ethernet interface.
This is in fact something similar to this :
while (n != 0)
{
  sendto(socket, packet, 4000);
  n -= 4000;
  packet += 4000
}

My interface is a bond with a 10Gbps interface and MTU set to 4096.
This means I have 784 packets each 1/30s which are sent on my
interface by one thread, then I wait for the next burst, and so on.
The videos are not necessarily the same video, so the threads may send
simultaneously or not...

My socket is in blocking mode.

JM

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] net: remove skb_orphan_try()
  2012-06-14 10:31         ` David Miller
@ 2012-06-14 16:42           ` Eric Dumazet
  2012-06-15  7:15             ` Oliver Hartkopp
  2012-06-15 22:31             ` David Miller
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2012-06-14 16:42 UTC (permalink / raw)
  To: David Miller; +Cc: jhautbois, netdev

From: Eric Dumazet <edumazet@google.com>

On Thu, 2012-06-14 at 03:31 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>

> > We should have a way to properly park packets in Qdiscs, and only do the
> > orphaning once skb given to real device for 'immediate or so'
> > transmission.
> 
> Ok.

In the other hand, all this stuff happens too late with BQL, since more
packets are parked in a Qdisc instead of being delivered with hot
caches.

Doing the orphaning once packet was enqueued, then dequeued, is probably
not worth adding yet another test in fast path.



[PATCH] net: remove skb_orphan_try()

Orphaning skb in dev_hard_start_xmit() makes bonding behavior
unfriendly for applications sending big UDP bursts : Once packets
pass the bonding device and come to real device, they might hit a full
qdisc and be dropped. Without orphaning, the sender is automatically
throttled because sk->sk_wmemalloc reaches sk->sk_sndbuf (assuming
sk_sndbuf is not too big)

We could try to defer the orphaning adding another test in
dev_hard_start_xmit(), but all this seems of little gain,
now that BQL tends to make packets more likely to be parked
in Qdisc queues instead of NIC TX ring, in cases where performance
matters.

Reverts commits :
fc6055a5ba31 net: Introduce skb_orphan_try()
87fd308cfc6b net: skb_tx_hash() fix relative to skb_orphan_try()
and removes SKBTX_DRV_NEEDS_SK_REF flag

Reported-and-bisected-by: Jean-Michel Hautbois <jhautbois@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |    7 ++-----
 net/can/raw.c          |    3 ---
 net/core/dev.c         |   23 +----------------------
 net/iucv/af_iucv.c     |    1 -
 4 files changed, 3 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b534a1b..642cb73 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -225,14 +225,11 @@ enum {
 	/* device driver is going to provide hardware time stamp */
 	SKBTX_IN_PROGRESS = 1 << 2,
 
-	/* ensure the originating sk reference is available on driver level */
-	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
-
 	/* device driver supports TX zero-copy buffers */
-	SKBTX_DEV_ZEROCOPY = 1 << 4,
+	SKBTX_DEV_ZEROCOPY = 1 << 3,
 
 	/* generate wifi status information (where possible) */
-	SKBTX_WIFI_STATUS = 1 << 5,
+	SKBTX_WIFI_STATUS = 1 << 4,
 };
 
 /*
diff --git a/net/can/raw.c b/net/can/raw.c
index cde1b4a..46cca3a 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -681,9 +681,6 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
 	if (err < 0)
 		goto free_skb;
 
-	/* to be able to check the received tx sock reference in raw_rcv() */
-	skb_shinfo(skb)->tx_flags |= SKBTX_DRV_NEEDS_SK_REF;
-
 	skb->dev = dev;
 	skb->sk  = sk;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index cd09819..6df2140 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2089,25 +2089,6 @@ static int dev_gso_segment(struct sk_buff *skb, netdev_features_t features)
 	return 0;
 }
 
-/*
- * Try to orphan skb early, right before transmission by the device.
- * We cannot orphan skb if tx timestamp is requested or the sk-reference
- * is needed on driver level for other reasons, e.g. see net/can/raw.c
- */
-static inline void skb_orphan_try(struct sk_buff *skb)
-{
-	struct sock *sk = skb->sk;
-
-	if (sk && !skb_shinfo(skb)->tx_flags) {
-		/* skb_tx_hash() wont be able to get sk.
-		 * We copy sk_hash into skb->rxhash
-		 */
-		if (!skb->rxhash)
-			skb->rxhash = sk->sk_hash;
-		skb_orphan(skb);
-	}
-}
-
 static bool can_checksum_protocol(netdev_features_t features, __be16 protocol)
 {
 	return ((features & NETIF_F_GEN_CSUM) ||
@@ -2193,8 +2174,6 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
 
-		skb_orphan_try(skb);
-
 		features = netif_skb_features(skb);
 
 		if (vlan_tx_tag_present(skb) &&
@@ -2304,7 +2283,7 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
-		hash = (__force u16) skb->protocol ^ skb->rxhash;
+		hash = (__force u16) skb->protocol;
 	hash = jhash_1word(hash, hashrnd);
 
 	return (u16) (((u64) hash * qcount) >> 32) + qoffset;
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 07d7d55..cd6f7a9 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -372,7 +372,6 @@ static int afiucv_hs_send(struct iucv_message *imsg, struct sock *sock,
 			skb_trim(skb, skb->dev->mtu);
 	}
 	skb->protocol = ETH_P_AF_IUCV;
-	skb_shinfo(skb)->tx_flags |= SKBTX_DRV_NEEDS_SK_REF;
 	nskb = skb_clone(skb, GFP_ATOMIC);
 	if (!nskb)
 		return -ENOMEM;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Regression on TX throughput when using bonding
  2012-06-14 15:43           ` Jean-Michel Hautbois
@ 2012-06-14 17:46             ` Rick Jones
  0 siblings, 0 replies; 15+ messages in thread
From: Rick Jones @ 2012-06-14 17:46 UTC (permalink / raw)
  To: Jean-Michel Hautbois; +Cc: Eric Dumazet, netdev

On 06/14/2012 08:43 AM, Jean-Michel Hautbois wrote:
> As suggested by Eric, here is a description I wish to be as precise as possible.
> I send three RAW video frames, 1920x1088@30fps on three udp sockets to
> the same NIC.
> Each sending is in a thread, so I will focus on the numbers for one thread.
>
> This generates burst of send(), as this : each 1/30s send 3.133.440
> bytes to the ethernet interface.
> This is in fact something similar to this :
> while (n != 0)
> {
>    sendto(socket, packet, 4000);
>    n -= 4000;
>    packet += 4000
> }
>
> My interface is a bond with a 10Gbps interface and MTU set to 4096.
> This means I have 784 packets each 1/30s which are sent on my
> interface by one thread, then I wait for the next burst, and so on.
> The videos are not necessarily the same video, so the threads may send
> simultaneously or not...
>
> My socket is in blocking mode.

If desired, here is how to simulate that with netperf:

./configure --enable-intervals
make

And an example over loopback:

raj@tardy:~/netperf2_trunk$ src/netperf -l 10 -t UDP_STREAM -H localhost 
-w 33 -b 783 -- -s 1M -S 1M -m 4000
MIGRATED UDP STREAM TEST from ::0 (::) port 0 AF_INET6 to tardy (::1) 
port 0 AF_INET6 : interval
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

2097152    4000   9.99       260739      0     835.18
2097152           9.99       260442            834.23


Adjust the -s and/or -S options to match what Jean-Michel's application 
uses for socket buffer sizes.  Run another two simultaneous instances to 
get the three streams.  Adjust the run length with the -l option.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] net: remove skb_orphan_try()
  2012-06-14 16:42           ` [PATCH] net: remove skb_orphan_try() Eric Dumazet
@ 2012-06-15  7:15             ` Oliver Hartkopp
  2012-06-15 22:31             ` David Miller
  1 sibling, 0 replies; 15+ messages in thread
From: Oliver Hartkopp @ 2012-06-15  7:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jhautbois, netdev

On 14.06.2012 18:42, Eric Dumazet wrote:


> 
> [PATCH] net: remove skb_orphan_try()
> 
> Orphaning skb in dev_hard_start_xmit() makes bonding behavior
> unfriendly for applications sending big UDP bursts : Once packets
> pass the bonding device and come to real device, they might hit a full
> qdisc and be dropped. Without orphaning, the sender is automatically
> throttled because sk->sk_wmemalloc reaches sk->sk_sndbuf (assuming
> sk_sndbuf is not too big)
> 
> We could try to defer the orphaning adding another test in
> dev_hard_start_xmit(), but all this seems of little gain,
> now that BQL tends to make packets more likely to be parked
> in Qdisc queues instead of NIC TX ring, in cases where performance
> matters.
> 
> Reverts commits :
> fc6055a5ba31 net: Introduce skb_orphan_try()
> 87fd308cfc6b net: skb_tx_hash() fix relative to skb_orphan_try()
> and removes SKBTX_DRV_NEEDS_SK_REF flag
> 
> Reported-and-bisected-by: Jean-Michel Hautbois <jhautbois@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/skbuff.h |    7 ++-----
>  net/can/raw.c          |    3 ---
>  net/core/dev.c         |   23 +----------------------
>  net/iucv/af_iucv.c     |    1 -
>  4 files changed, 3 insertions(+), 31 deletions(-)
> 


As i originally introduced the SKBTX_DRV_NEEDS_SK_REF flag to preserve the skb
packet flow for CAN i'm pretty happy to see that skb_orphan_try() goes away
again :-)

I also tested this patch with my test-apps that would detect these problems.

Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>

Tnx Eric!

Regards,
Oliver

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] net: remove skb_orphan_try()
  2012-06-14 16:42           ` [PATCH] net: remove skb_orphan_try() Eric Dumazet
  2012-06-15  7:15             ` Oliver Hartkopp
@ 2012-06-15 22:31             ` David Miller
  1 sibling, 0 replies; 15+ messages in thread
From: David Miller @ 2012-06-15 22:31 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jhautbois, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 14 Jun 2012 18:42:44 +0200

> [PATCH] net: remove skb_orphan_try()
> 
> Orphaning skb in dev_hard_start_xmit() makes bonding behavior
> unfriendly for applications sending big UDP bursts : Once packets
> pass the bonding device and come to real device, they might hit a full
> qdisc and be dropped. Without orphaning, the sender is automatically
> throttled because sk->sk_wmemalloc reaches sk->sk_sndbuf (assuming
> sk_sndbuf is not too big)
> 
> We could try to defer the orphaning adding another test in
> dev_hard_start_xmit(), but all this seems of little gain,
> now that BQL tends to make packets more likely to be parked
> in Qdisc queues instead of NIC TX ring, in cases where performance
> matters.
> 
> Reverts commits :
> fc6055a5ba31 net: Introduce skb_orphan_try()
> 87fd308cfc6b net: skb_tx_hash() fix relative to skb_orphan_try()
> and removes SKBTX_DRV_NEEDS_SK_REF flag
> 
> Reported-and-bisected-by: Jean-Michel Hautbois <jhautbois@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-06-15 22:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-14  8:58 Regression on TX throughput when using bonding Jean-Michel Hautbois
2012-06-14  9:21 ` Eric Dumazet
2012-06-14  9:40   ` Jean-Michel Hautbois
2012-06-14  9:50   ` Eric Dumazet
2012-06-14 10:00     ` David Miller
2012-06-14 10:07       ` Eric Dumazet
2012-06-14 10:31         ` David Miller
2012-06-14 16:42           ` [PATCH] net: remove skb_orphan_try() Eric Dumazet
2012-06-15  7:15             ` Oliver Hartkopp
2012-06-15 22:31             ` David Miller
2012-06-14 10:15     ` Regression on TX throughput when using bonding Jean-Michel Hautbois
2012-06-14 14:14       ` Jean-Michel Hautbois
2012-06-14 14:29         ` Eric Dumazet
2012-06-14 15:43           ` Jean-Michel Hautbois
2012-06-14 17:46             ` Rick Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.