All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 2/2] bnx2x: use the default NAPI weight
@ 2013-03-06  1:57 Eric Dumazet
  2013-03-06  2:09 ` David Miller
  2013-03-06  4:59 ` David Miller
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Dumazet @ 2013-03-06  1:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein

From: Eric Dumazet <edumazet@google.com>

BQL (Byte Queue Limits) proper operation needs TX completion
being serviced in a timely fashion.

bnx2x uses a non standard NAPI poll weight, and thats not fair to other
napi poll handlers, and even not reasonable.

Use the default value instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h     |    1 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |    4 ++--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index e4605a9..9577cce 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -492,7 +492,6 @@ enum bnx2x_tpa_mode_t {
 struct bnx2x_fastpath {
 	struct bnx2x		*bp; /* parent */
 
-#define BNX2X_NAPI_WEIGHT       128
 	struct napi_struct	napi;
 	union host_hc_status_block	status_blk;
 	/* chip independed shortcuts into sb structure */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index aee7671..8d158d8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -834,7 +834,7 @@ static inline void bnx2x_add_all_napi_cnic(struct bnx2x *bp)
 	/* Add NAPI objects */
 	for_each_rx_queue_cnic(bp, i)
 		netif_napi_add(bp->dev, &bnx2x_fp(bp, i, napi),
-			       bnx2x_poll, BNX2X_NAPI_WEIGHT);
+			       bnx2x_poll, NAPI_POLL_WEIGHT);
 }
 
 static inline void bnx2x_add_all_napi(struct bnx2x *bp)
@@ -844,7 +844,7 @@ static inline void bnx2x_add_all_napi(struct bnx2x *bp)
 	/* Add NAPI objects */
 	for_each_eth_queue(bp, i)
 		netif_napi_add(bp->dev, &bnx2x_fp(bp, i, napi),
-			       bnx2x_poll, BNX2X_NAPI_WEIGHT);
+			       bnx2x_poll, NAPI_POLL_WEIGHT);
 }
 
 static inline void bnx2x_del_all_napi_cnic(struct bnx2x *bp)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06  1:57 [PATCH net-next 2/2] bnx2x: use the default NAPI weight Eric Dumazet
@ 2013-03-06  2:09 ` David Miller
  2013-03-06  3:28   ` Eric Dumazet
  2013-03-06  4:59 ` David Miller
  1 sibling, 1 reply; 10+ messages in thread
From: David Miller @ 2013-03-06  2:09 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, eilong

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Mar 2013 17:57:47 -0800

> BQL (Byte Queue Limits) proper operation needs TX completion
> being serviced in a timely fashion.
> 
> bnx2x uses a non standard NAPI poll weight, and thats not fair to other
> napi poll handlers, and even not reasonable.

Can you give some details about the situation in which you noticed
this?

It may be reason enough to target this for 'net' and -stable instead.

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06  2:09 ` David Miller
@ 2013-03-06  3:28   ` Eric Dumazet
  2013-03-06  4:37     ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2013-03-06  3:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eilong

On Tue, 2013-03-05 at 21:09 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 05 Mar 2013 17:57:47 -0800
> 
> > BQL (Byte Queue Limits) proper operation needs TX completion
> > being serviced in a timely fashion.
> > 
> > bnx2x uses a non standard NAPI poll weight, and thats not fair to other
> > napi poll handlers, and even not reasonable.
> 
> Can you give some details about the situation in which you noticed
> this?
> 
> It may be reason enough to target this for 'net' and -stable instead.

Sure :

We are facing a fanout/fanin problem for a given application.

What happens is that link is mostly idle, and every couple of minutes,
we receive a burst of messages, needing a burst of transmits.

BQL starts with a limit of 0, meaning it needs some TX completion runs
to be able to detect the sudden increase of TX need and reach line rate.

bnx2x is really unfair in this scenario, because the NAPI poll budget is
very high, and aggregated packets (by LRO) count for a single packet in
the budget. (or even 0, see the "goto next_cqe;" that avoids the rx_pkt
++; in drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c line 1007)

We probably have other issues in the stack, I am patiently finding all
of them. In the mean time we had to disable BQL :(

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06  3:28   ` Eric Dumazet
@ 2013-03-06  4:37     ` David Miller
  2013-03-06  7:03       ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2013-03-06  4:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, eilong

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Mar 2013 19:28:29 -0800

> We are facing a fanout/fanin problem for a given application.
> 
> What happens is that link is mostly idle, and every couple of minutes,
> we receive a burst of messages, needing a burst of transmits.
> 
> BQL starts with a limit of 0, meaning it needs some TX completion runs
> to be able to detect the sudden increase of TX need and reach line rate.
> 
> bnx2x is really unfair in this scenario, because the NAPI poll budget is
> very high, and aggregated packets (by LRO) count for a single packet in
> the budget. (or even 0, see the "goto next_cqe;" that avoids the rx_pkt
> ++; in drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c line 1007)
> 
> We probably have other issues in the stack, I am patiently finding all
> of them. In the mean time we had to disable BQL :(

Thanks for the explanation.

Since you haven't completely resolved the issues you were running into
I'll target this to net-next for now.

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06  1:57 [PATCH net-next 2/2] bnx2x: use the default NAPI weight Eric Dumazet
  2013-03-06  2:09 ` David Miller
@ 2013-03-06  4:59 ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: David Miller @ 2013-03-06  4:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, eilong

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Mar 2013 17:57:47 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> BQL (Byte Queue Limits) proper operation needs TX completion
> being serviced in a timely fashion.
> 
> bnx2x uses a non standard NAPI poll weight, and thats not fair to other
> napi poll handlers, and even not reasonable.
> 
> Use the default value instead.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Eilon Greenstein <eilong@broadcom.com>

Applied.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06  4:37     ` David Miller
@ 2013-03-06  7:03       ` Eric Dumazet
  2013-03-06 19:59         ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2013-03-06  7:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eilong

On Tue, 2013-03-05 at 23:37 -0500, David Miller wrote:

> Thanks for the explanation.
> 
> Since you haven't completely resolved the issues you were running into
> I'll target this to net-next for now.

Thanks David

An other issue is the spin_trylock() attempted in net_tx_action()

It seems we can miss a qdisc_run(), and have to wait the following
NET_TX softirq(s) to send more data. NET_RX being interleaved, we can
have to wait a long time (not mentioning other softirq handlers like
RCU ...)

I might be too tired right now, but cant see the reason of the trylock.

qdisc lock is already BH safe, so we should do a spinlock

I'll test the following patch tomorrow :

 net/core/dev.c |   21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index a06a7a5..d5d4573 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3201,22 +3201,11 @@ static void net_tx_action(struct softirq_action *h)
 			head = head->next_sched;
 
 			root_lock = qdisc_lock(q);
-			if (spin_trylock(root_lock)) {
-				smp_mb__before_clear_bit();
-				clear_bit(__QDISC_STATE_SCHED,
-					  &q->state);
-				qdisc_run(q);
-				spin_unlock(root_lock);
-			} else {
-				if (!test_bit(__QDISC_STATE_DEACTIVATED,
-					      &q->state)) {
-					__netif_reschedule(q);
-				} else {
-					smp_mb__before_clear_bit();
-					clear_bit(__QDISC_STATE_SCHED,
-						  &q->state);
-				}
-			}
+			spin_lock(root_lock);
+			smp_mb__before_clear_bit();
+			clear_bit(__QDISC_STATE_SCHED, &q->state);
+			qdisc_run(q);
+			spin_unlock(root_lock);
 		}
 	}
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06  7:03       ` Eric Dumazet
@ 2013-03-06 19:59         ` David Miller
  2013-03-06 21:52           ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2013-03-06 19:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, eilong, jhs, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 05 Mar 2013 23:03:18 -0800

> On Tue, 2013-03-05 at 23:37 -0500, David Miller wrote:
> 
>> Thanks for the explanation.
>> 
>> Since you haven't completely resolved the issues you were running into
>> I'll target this to net-next for now.
> 
> Thanks David
> 
> An other issue is the spin_trylock() attempted in net_tx_action()
> 
> It seems we can miss a qdisc_run(), and have to wait the following
> NET_TX softirq(s) to send more data. NET_RX being interleaved, we can
> have to wait a long time (not mentioning other softirq handlers like
> RCU ...)
> 
> I might be too tired right now, but cant see the reason of the trylock.
> 
> qdisc lock is already BH safe, so we should do a spinlock
 ...
> @@ -3201,22 +3201,11 @@ static void net_tx_action(struct softirq_action *h)
>  			head = head->next_sched;
>  
>  			root_lock = qdisc_lock(q);
> -			if (spin_trylock(root_lock)) {
> -				smp_mb__before_clear_bit();
> -				clear_bit(__QDISC_STATE_SCHED,
> -					  &q->state);
> -				qdisc_run(q);
> -				spin_unlock(root_lock);

I think this trylock is intentional, but not to deal with BH safeness,
but rather to allow another cpu already processing the qdisc to
continue doing so.

I think this is what Jamal's amazing flash animations back at netconf
in Toronto were all about :-)

Herbert Xu and Jamal have touched upon this issue several times in
the past.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06 19:59         ` David Miller
@ 2013-03-06 21:52           ` Eric Dumazet
  2013-03-10  8:38             ` Herbert Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2013-03-06 21:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eilong, jhs, herbert, Tom Herbert

On Wed, 2013-03-06 at 14:59 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 05 Mar 2013 23:03:18 -0800
> 
> > On Tue, 2013-03-05 at 23:37 -0500, David Miller wrote:
> > 
> >> Thanks for the explanation.
> >> 
> >> Since you haven't completely resolved the issues you were running into
> >> I'll target this to net-next for now.
> > 
> > Thanks David
> > 
> > An other issue is the spin_trylock() attempted in net_tx_action()
> > 
> > It seems we can miss a qdisc_run(), and have to wait the following
> > NET_TX softirq(s) to send more data. NET_RX being interleaved, we can
> > have to wait a long time (not mentioning other softirq handlers like
> > RCU ...)
> > 
> > I might be too tired right now, but cant see the reason of the trylock.
> > 
> > qdisc lock is already BH safe, so we should do a spinlock
>  ...
> > @@ -3201,22 +3201,11 @@ static void net_tx_action(struct softirq_action *h)
> >  			head = head->next_sched;
> >  
> >  			root_lock = qdisc_lock(q);
> > -			if (spin_trylock(root_lock)) {
> > -				smp_mb__before_clear_bit();
> > -				clear_bit(__QDISC_STATE_SCHED,
> > -					  &q->state);
> > -				qdisc_run(q);
> > -				spin_unlock(root_lock);
> 
> I think this trylock is intentional, but not to deal with BH safeness,
> but rather to allow another cpu already processing the qdisc to
> continue doing so.
> 
> I think this is what Jamal's amazing flash animations back at netconf
> in Toronto were all about :-)

Yes, but with :

- BQL (incurring more TX completion rounds and possibility to
block/unblock a qdisc)
- ticket spinlocks, and even with the guard of qdisc busylock

-> we can have a starvation problem.

I noticed on perf top sessions once cpu kept scheduling NET_TX softirqs
in (almost) infinite loops.

(if trylock() doesn't succeed, this cpu requeue this qdisc for another
net_tx_action() run)

BTW, I wonder if we should not exchange NET_TX_SOFTIRQ & NET_RX_SOFTIRQ 

Usually the net_rx_action() calls napi poll() and TX completion, and
netdev_tx_completed_queue() unblocks a qdisc (requesting a
netif_schedule_queue() -> scheduling a NT_TX_SOFTIRQ)

Or... maybe netdev_tx_completed_queue() should directly call qdisc_run()
instead of deferring it ?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-06 21:52           ` Eric Dumazet
@ 2013-03-10  8:38             ` Herbert Xu
  2013-03-11  6:14               ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Herbert Xu @ 2013-03-10  8:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, eilong, jhs, Tom Herbert

On Wed, Mar 06, 2013 at 01:52:57PM -0800, Eric Dumazet wrote:
>
> - BQL (incurring more TX completion rounds and possibility to
> block/unblock a qdisc)
> - ticket spinlocks, and even with the guard of qdisc busylock
> 
> -> we can have a starvation problem.

This only happens in cases where we aren't using multiqueue or
we're using it incorrectly, resulting in TX work from being split
over CPUs.

In that case it's not clear that it is starvation if we keep the
TX processing on one CPU.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: use the default NAPI weight
  2013-03-10  8:38             ` Herbert Xu
@ 2013-03-11  6:14               ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2013-03-11  6:14 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, eilong, jhs, Tom Herbert

On Sun, 2013-03-10 at 16:38 +0800, Herbert Xu wrote:
> On Wed, Mar 06, 2013 at 01:52:57PM -0800, Eric Dumazet wrote:
> >
> > - BQL (incurring more TX completion rounds and possibility to
> > block/unblock a qdisc)
> > - ticket spinlocks, and even with the guard of qdisc busylock
> > 
> > -> we can have a starvation problem.
> 
> This only happens in cases where we aren't using multiqueue or
> we're using it incorrectly, resulting in TX work from being split
> over CPUs.
> 

Or using qdisc like HTB ;)

> In that case it's not clear that it is starvation if we keep the
> TX processing on one CPU.

Thats not always the case. TX path has interesting features like XPS ...

We probably could add instrumentation and keep track of the maximum time
we are spending in this dark area.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-03-11  6:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-06  1:57 [PATCH net-next 2/2] bnx2x: use the default NAPI weight Eric Dumazet
2013-03-06  2:09 ` David Miller
2013-03-06  3:28   ` Eric Dumazet
2013-03-06  4:37     ` David Miller
2013-03-06  7:03       ` Eric Dumazet
2013-03-06 19:59         ` David Miller
2013-03-06 21:52           ` Eric Dumazet
2013-03-10  8:38             ` Herbert Xu
2013-03-11  6:14               ` Eric Dumazet
2013-03-06  4:59 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.