* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-27 14:21 ` [PATCH v2 " Eric Dumazet
@ 2017-02-27 16:19 ` David Miller
2017-02-27 16:44 ` Eric Dumazet
2017-02-27 21:00 ` Alexander Duyck
2017-02-27 20:18 ` [PATCH v3 " Eric Dumazet
2017-02-28 17:20 ` [PATCH v2 " Alexander Duyck
2 siblings, 2 replies; 34+ messages in thread
From: David Miller @ 2017-02-27 16:19 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, tariqt, saeedm
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2017 06:21:38 -0800
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
>
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
>
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED
Various rules were meant to protect these sequences, and make sure
nothing like this race could happen.
Can you show the specific sequence that fails?
One of the basic protections is that the device IRQ is not re-enabled
until napi_complete_done() is finished, most drivers do something like
this:
napi_complete_done();
- sets NAPI_STATE_SCHED
enable device IRQ
So I don't understand how it is possible that "later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it".
While napi_complete_done() is running, the device's IRQ is still
disabled, so there cannot be an IRQ firing before napi_complete_done()
is finished.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-27 16:19 ` David Miller
@ 2017-02-27 16:44 ` Eric Dumazet
2017-02-27 17:10 ` Eric Dumazet
2017-02-28 2:08 ` David Miller
2017-02-27 21:00 ` Alexander Duyck
1 sibling, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 16:44 UTC (permalink / raw)
To: David Miller; +Cc: netdev, tariqt, saeedm
On Mon, 2017-02-27 at 11:19 -0500, David Miller wrote:
> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
>
> Can you show the specific sequence that fails?
>
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
>
> napi_complete_done();
> - sets NAPI_STATE_SCHED
> enable device IRQ
>
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
>
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.
Any point doing a napi_schedule() not from device hard irq handler
is subject to the race for NIC using some kind of edge trigger interrupts.
Since we do not provide a ndo to disable device interrupts,
the following can happen.
thread 1 thread 2 (could be on same cpu)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is available.
device hard irq
// does nothing because NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional IRQ
for the packet it already signaled (3rd packet in my example)
At least for mlx4, only 4th packet will trigger the IRQ again.
In the old days, the race would not happen since napi->poll() was called
in direct response to a prior device IRQ :
Edge triggered hard irqs from the device for this queue were already disabled.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-27 16:44 ` Eric Dumazet
@ 2017-02-27 17:10 ` Eric Dumazet
2017-02-28 2:08 ` David Miller
1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 17:10 UTC (permalink / raw)
To: David Miller; +Cc: netdev, tariqt, saeedm
On Mon, 2017-02-27 at 08:44 -0800, Eric Dumazet wrote:
> // busy polling or napi_watchdog()
BTW, we also can add to the beginning of busy_poll_stop() :
clear_bit(NAPI_STATE_MISSED, &napi->state);
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-27 16:44 ` Eric Dumazet
2017-02-27 17:10 ` Eric Dumazet
@ 2017-02-28 2:08 ` David Miller
2017-03-01 0:22 ` Francois Romieu
1 sibling, 1 reply; 34+ messages in thread
From: David Miller @ 2017-02-28 2:08 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, tariqt, saeedm
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2017 08:44:14 -0800
> Any point doing a napi_schedule() not from device hard irq handler
> is subject to the race for NIC using some kind of edge trigger
> interrupts.
>
> Since we do not provide a ndo to disable device interrupts, the
> following can happen.
Ok, now I understand.
I think even without considering the race you are trying to solve,
this situation is really dangerous.
I am sure that every ->poll() handler out there was written by an
author who completely assumed that if they are executing then the
device's interrupts for that NAPI instance are disabled. And this is
with very few, if any, exceptions.
So if we saw a driver doing something like:
reg->irq_enable ^= value;
after napi_complete_done(), it would be quite understandable.
We really made a mistake taking the napi_schedule() call out of
the domain of the driver so that it could manage the interrupt
state properly.
I'm not against your missed bit fix as a short-term cure for now, it's
just that somewhere down the road we need to manage the interrupt
properly.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-28 2:08 ` David Miller
@ 2017-03-01 0:22 ` Francois Romieu
2017-03-01 1:04 ` Stephen Hemminger
0 siblings, 1 reply; 34+ messages in thread
From: Francois Romieu @ 2017-03-01 0:22 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev, tariqt, saeedm
David Miller <davem@davemloft.net> :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Feb 2017 08:44:14 -0800
>
> > Any point doing a napi_schedule() not from device hard irq handler
> > is subject to the race for NIC using some kind of edge trigger
> > interrupts.
> >
> > Since we do not provide a ndo to disable device interrupts, the
> > following can happen.
>
> Ok, now I understand.
>
> I think even without considering the race you are trying to solve,
> this situation is really dangerous.
>
> I am sure that every ->poll() handler out there was written by an
> author who completely assumed that if they are executing then the
> device's interrupts for that NAPI instance are disabled. And this is
> with very few, if any, exceptions.
Shareable pci irq used to remind author that such an assumption was
not always right. Otoh it was still manageable as long as level only
triggered irq were involved.
--
Ueimor
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-03-01 0:22 ` Francois Romieu
@ 2017-03-01 1:04 ` Stephen Hemminger
0 siblings, 0 replies; 34+ messages in thread
From: Stephen Hemminger @ 2017-03-01 1:04 UTC (permalink / raw)
To: Francois Romieu; +Cc: David Miller, eric.dumazet, netdev, tariqt, saeedm
On Wed, 1 Mar 2017 01:22:40 +0100
Francois Romieu <romieu@fr.zoreil.com> wrote:
> David Miller <davem@davemloft.net> :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 27 Feb 2017 08:44:14 -0800
> >
> > > Any point doing a napi_schedule() not from device hard irq handler
> > > is subject to the race for NIC using some kind of edge trigger
> > > interrupts.
> > >
> > > Since we do not provide a ndo to disable device interrupts, the
> > > following can happen.
> >
> > Ok, now I understand.
> >
> > I think even without considering the race you are trying to solve,
> > this situation is really dangerous.
> >
> > I am sure that every ->poll() handler out there was written by an
> > author who completely assumed that if they are executing then the
> > device's interrupts for that NAPI instance are disabled. And this is
> > with very few, if any, exceptions.
>
> Shareable pci irq used to remind author that such an assumption was
> not always right. Otoh it was still manageable as long as level only
> triggered irq were involved.
>
When I had to deal with that in sky2, the best way was to have a single
NAPI poll handler shared between both ports. Works well and avoids races
in interrupt handling and enabling.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-27 16:19 ` David Miller
2017-02-27 16:44 ` Eric Dumazet
@ 2017-02-27 21:00 ` Alexander Duyck
1 sibling, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2017-02-27 21:00 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, Netdev, Tariq Toukan, Saeed Mahameed
On Mon, Feb 27, 2017 at 8:19 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Feb 2017 06:21:38 -0800
>
>> A NAPI driver normally arms the IRQ after the napi_complete_done(),
>> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
>> it.
>>
>> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
>> while IRQ are not disabled, we might have later an IRQ firing and
>> finding this bit set, right before napi_complete_done() clears it.
>>
>> This can happen with busy polling users, or if gro_flush_timeout is
>> used. But some other uses of napi_schedule() in drivers can cause this
>> as well.
>>
>> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
>> can set if it could not grab NAPI_STATE_SCHED
>
> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
>
> Can you show the specific sequence that fails?
>
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
>
> napi_complete_done();
> - sets NAPI_STATE_SCHED
> enable device IRQ
>
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
>
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.
So there are some drivers that will need to have the interrupts
enabled when busy polling and I assume that can cause this kind of
issue. Specifically in the case of i40e the part will not flush
completed descriptors until either 4 completed descriptors are ready
to be written back, or an interrupt fires.
Our other drivers have code in them that will force the interrupt to
unmask and fire once every 2 seconds in the unlikely event that an
interrupt was lost which can occur on some platforms.
- Alex
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v3 net] net: solve a NAPI race
2017-02-27 14:21 ` [PATCH v2 " Eric Dumazet
2017-02-27 16:19 ` David Miller
@ 2017-02-27 20:18 ` Eric Dumazet
2017-02-27 22:14 ` Stephen Hemminger
` (2 more replies)
2017-02-28 17:20 ` [PATCH v2 " Alexander Duyck
2 siblings, 3 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 20:18 UTC (permalink / raw)
To: David Miller; +Cc: netdev
From: Eric Dumazet <edumazet@google.com>
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is available.
device hard irq
// does nothing because NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/netdevice.h | 29 +++++-----------
net/core/dev.c | 63 +++++++++++++++++++++++++++++++++---
2 files changed, 68 insertions(+), 24 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
enum {
NAPI_STATE_SCHED, /* Poll is scheduled */
+ NAPI_STATE_MISSED, /* reschedule a napi */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
NAPI_STATE_HASHED, /* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
};
enum {
- NAPIF_STATE_SCHED = (1UL << NAPI_STATE_SCHED),
- NAPIF_STATE_DISABLE = (1UL << NAPI_STATE_DISABLE),
- NAPIF_STATE_NPSVC = (1UL << NAPI_STATE_NPSVC),
- NAPIF_STATE_HASHED = (1UL << NAPI_STATE_HASHED),
- NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
- NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+ NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED),
+ NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED),
+ NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE),
+ NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC),
+ NAPIF_STATE_HASHED = BIT(NAPI_STATE_HASHED),
+ NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+ NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
};
enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
return test_bit(NAPI_STATE_DISABLE, &n->state);
}
-/**
- * napi_schedule_prep - check if NAPI can be scheduled
- * @n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running. This is used as a condition variable to
- * insure only one NAPI poll instance runs. We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
- return !napi_disable_pending(n) &&
- !test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
/**
* napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..afcab3670aac18a9c193bf5a09e36e3dc9d0d63c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
EXPORT_SYMBOL(__napi_schedule);
/**
+ * napi_schedule_prep - check if napi can be scheduled
+ * @n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running. This is used as a condition variable
+ * insure only one NAPI poll instance runs. We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+ unsigned long val, new;
+
+ do {
+ val = READ_ONCE(n->state);
+ if (unlikely(val & NAPIF_STATE_DISABLE))
+ return false;
+ new = val | NAPIF_STATE_SCHED;
+ if (unlikely(val & NAPIF_STATE_SCHED))
+ new |= NAPIF_STATE_MISSED;
+ } while (cmpxchg(&n->state, val, new) != val);
+
+ return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
* __napi_schedule_irqoff - schedule for receive
* @n: entry to schedule
*
@@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
bool napi_complete_done(struct napi_struct *n, int work_done)
{
- unsigned long flags;
+ unsigned long flags, val, new;
/*
* 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
list_del_init(&n->poll_list);
local_irq_restore(flags);
}
- WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+ do {
+ val = READ_ONCE(n->state);
+
+ WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+ new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+ if (unlikely(val & NAPIF_STATE_MISSED))
+ new |= NAPIF_STATE_SCHED;
+ } while (cmpxchg(&n->state, val, new) != val);
+
+ if (unlikely(val & NAPIF_STATE_MISSED))
+ __napi_schedule(n);
+
return true;
}
EXPORT_SYMBOL(napi_complete_done);
@@ -4953,6 +4993,16 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
{
int rc;
+ /* Busy polling means there is a high chance device driver hard irq
+ * could not grab NAPI_STATE_SCHED, and that NAPI_STATE_MISSED was
+ * set in napi_schedule_prep().
+ * Since we are about to call napi->poll() once more, we can safely
+ * clear NAPI_STATE_MISSED.
+ *
+ * Note: x86 could use a single "lock and ..." instruction
+ * to perform these two clear_bit()
+ */
+ clear_bit(NAPI_STATE_MISSED, &napi->state);
clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
local_bh_disable();
@@ -5088,8 +5138,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
struct napi_struct *napi;
napi = container_of(timer, struct napi_struct, timer);
- if (napi->gro_list)
- napi_schedule_irqoff(napi);
+
+ /* Note : we use a relaxed variant of napi_schedule_prep() not setting
+ * NAPI_STATE_MISSED, since we do not react to a device IRQ.
+ */
+ if (napi->gro_list && !napi_disable_pending(napi) &&
+ !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
+ __napi_schedule_irqoff(napi);
return HRTIMER_NORESTART;
}
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-27 20:18 ` [PATCH v3 " Eric Dumazet
@ 2017-02-27 22:14 ` Stephen Hemminger
2017-02-27 22:35 ` Eric Dumazet
2017-02-28 16:17 ` Stephen Hemminger
2017-02-28 18:34 ` [PATCH v4 " Eric Dumazet
2 siblings, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-27 22:14 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
On Mon, 27 Feb 2017 12:18:31 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> thread 1 thread 2 (could be on same cpu)
>
> // busy polling or napi_watchdog()
> napi_schedule();
> ...
> napi->poll()
>
> device polling:
> read 2 packets from ring buffer
> Additional 3rd packet is available.
> device hard irq
>
> // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
> napi_schedule();
>
> napi_complete_done(napi, 2);
> rearm_irq();
The original design (as Davem mentioned) was that IRQ's must be disabled
during device polling. If that was true, then the race above
would be impossible. Also NAPI assumes interrupts are level triggered.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-27 22:14 ` Stephen Hemminger
@ 2017-02-27 22:35 ` Eric Dumazet
2017-02-27 22:44 ` Stephen Hemminger
2017-02-28 10:14 ` David Laight
0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 22:35 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: David Miller, netdev
On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> The original design (as Davem mentioned) was that IRQ's must be disabled
> during device polling. If that was true, then the race above
> would be impossible.
I would love to see an alternative patch.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-27 22:35 ` Eric Dumazet
@ 2017-02-27 22:44 ` Stephen Hemminger
2017-02-27 22:48 ` David Miller
2017-02-28 10:14 ` David Laight
1 sibling, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-27 22:44 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
On Mon, 27 Feb 2017 14:35:17 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
>
> > The original design (as Davem mentioned) was that IRQ's must be disabled
> > during device polling. If that was true, then the race above
> > would be impossible.
>
> I would love to see an alternative patch.
Turn off busy poll?
The poll stuff runs risk of breaking more things.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-27 22:44 ` Stephen Hemminger
@ 2017-02-27 22:48 ` David Miller
2017-02-27 23:23 ` Stephen Hemminger
0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2017-02-27 22:48 UTC (permalink / raw)
To: stephen; +Cc: eric.dumazet, netdev
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Mon, 27 Feb 2017 14:44:55 -0800
> On Mon, 27 Feb 2017 14:35:17 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
>>
>> > The original design (as Davem mentioned) was that IRQ's must be disabled
>> > during device polling. If that was true, then the race above
>> > would be impossible.
>>
>> I would love to see an alternative patch.
>
> Turn off busy poll?
> The poll stuff runs risk of breaking more things.
Eric is exactly trying to make busy poll even more prominent in
the stack, not less prominent.
It's an important component of some performance improvements he is
working on.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-27 22:48 ` David Miller
@ 2017-02-27 23:23 ` Stephen Hemminger
0 siblings, 0 replies; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-27 23:23 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
On Mon, 27 Feb 2017 17:48:54 -0500 (EST)
David Miller <davem@davemloft.net> wrote:
> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Mon, 27 Feb 2017 14:44:55 -0800
>
> > On Mon, 27 Feb 2017 14:35:17 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> >>
> >> > The original design (as Davem mentioned) was that IRQ's must be disabled
> >> > during device polling. If that was true, then the race above
> >> > would be impossible.
> >>
> >> I would love to see an alternative patch.
> >
> > Turn off busy poll?
> > The poll stuff runs risk of breaking more things.
>
> Eric is exactly trying to make busy poll even more prominent in
> the stack, not less prominent.
>
> It's an important component of some performance improvements he is
> working on.
Maybe making IRQ controlled as part of the network device model
(instead of a side effect left to device driver to handle) would
be less problematic.
Really just shooting in the dark because I don't have any of the problematic
hardware to play with.
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: [PATCH v3 net] net: solve a NAPI race
2017-02-27 22:35 ` Eric Dumazet
2017-02-27 22:44 ` Stephen Hemminger
@ 2017-02-28 10:14 ` David Laight
2017-02-28 13:04 ` Eric Dumazet
2017-02-28 13:20 ` Eric Dumazet
1 sibling, 2 replies; 34+ messages in thread
From: David Laight @ 2017-02-28 10:14 UTC (permalink / raw)
To: 'Eric Dumazet', Stephen Hemminger; +Cc: David Miller, netdev
From: Eric Dumazet
> Sent: 27 February 2017 22:35
> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
>
> > The original design (as Davem mentioned) was that IRQ's must be disabled
> > during device polling. If that was true, then the race above
> > would be impossible.
>
> I would love to see an alternative patch.
Can you test for 'receive data available' after telling the NAPI
logic that you've finished?
You'd then need to force a reschedule.
I think your proposed patch will do a reschedule if any packet arrives
during the receive processing, not just when one arrives right at the end.
You might want to 'unset' the reschedule flag before each check of the
receive ring.
I also wonder about the cost of processing the MSI-X (I guess) interrupts
compared to the cost of posted PCIe writes to disable and/or mask the
interrupt generation.
Clearly you don't want to do PCIe reads.
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-28 10:14 ` David Laight
@ 2017-02-28 13:04 ` Eric Dumazet
2017-02-28 13:20 ` Eric Dumazet
1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 13:04 UTC (permalink / raw)
To: David Laight; +Cc: Stephen Hemminger, David Miller, netdev
On Tue, 2017-02-28 at 10:14 +0000, David Laight wrote:
> From: Eric Dumazet
> > Sent: 27 February 2017 22:35
> > On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> >
> > > The original design (as Davem mentioned) was that IRQ's must be disabled
> > > during device polling. If that was true, then the race above
> > > would be impossible.
> >
> > I would love to see an alternative patch.
>
> Can you test for 'receive data available' after telling the NAPI
> logic that you've finished?
> You'd then need to force a reschedule.
>
> I think your proposed patch will do a reschedule if any packet arrives
> during the receive processing, not just when one arrives right at the end.
> You might want to 'unset' the reschedule flag before each check of the
> receive ring.
>
> I also wonder about the cost of processing the MSI-X (I guess) interrupts
> compared to the cost of posted PCIe writes to disable and/or mask the
> interrupt generation.
> Clearly you don't want to do PCIe reads.
Have you seen the mlx4 patch I provided ?
Then, I did not want to review 100+ NAPI drivers and provide patches for
them.
This generic solution is basically free. Same number of atomic
operations.
Given it took more than 2 years to even spot the bug, I have no idea how
people on netdev expect me to review all drivers. This is crazy.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-28 10:14 ` David Laight
2017-02-28 13:04 ` Eric Dumazet
@ 2017-02-28 13:20 ` Eric Dumazet
1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 13:20 UTC (permalink / raw)
To: David Laight; +Cc: Stephen Hemminger, David Miller, netdev
On Tue, 2017-02-28 at 10:14 +0000, David Laight wrote:
> From: Eric Dumazet
> > Sent: 27 February 2017 22:35
> > On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> >
> > > The original design (as Davem mentioned) was that IRQ's must be disabled
> > > during device polling. If that was true, then the race above
> > > would be impossible.
> >
> > I would love to see an alternative patch.
>
> Can you test for 'receive data available' after telling the NAPI
> logic that you've finished?
> You'd then need to force a reschedule.
>
You understand that a 'reschedule' is only firing another invocation of
napi->poll() right away ?
hot cache lines, basically 0 cost.
In my stress tests, this happens 0.01 % of the times. Bug is tiny.
(Otherwise we would have spotted it earlier)
> I think your proposed patch will do a reschedule if any packet arrives
> during the receive processing, not just when one arrives right at the end.
> You might want to 'unset' the reschedule flag before each check of the
> receive ring.
Well, no :
Interrupt has been masked before the napi poll was scheduled.
As David and Stephen says, this condition must not happen, unless your
are really unlucky while doing busy polling, which is opportunistic call
of napi->poll() while you have idle cycles on your cpu.
>
> I also wonder about the cost of processing the MSI-X (I guess) interrupts
> compared to the cost of posted PCIe writes to disable and/or mask the
> interrupt generation.
> Clearly you don't want to do PCIe reads.
Seems irrelevant to the bug we are discussing.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3 net] net: solve a NAPI race
2017-02-27 20:18 ` [PATCH v3 " Eric Dumazet
2017-02-27 22:14 ` Stephen Hemminger
@ 2017-02-28 16:17 ` Stephen Hemminger
2017-02-28 16:57 ` Eric Dumazet
2017-02-28 18:34 ` [PATCH v4 " Eric Dumazet
2 siblings, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-28 16:17 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
On Mon, 27 Feb 2017 12:18:31 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
Where were IRQ's re-enabled?
> thread 1 thread 2 (could be on same cpu)
>
> // busy polling or napi_watchdog()
> napi_schedule();
> ...
> napi->poll()
>
> device polling:
> read 2 packets from ring buffer
> Additional 3rd packet is available.
> device hard irq
>
> // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
> napi_schedule();
>
> napi_complete_done(napi, 2);
> rearm_irq();
Maybe just as simple as using irqsave/irqrestore in driver.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v4 net] net: solve a NAPI race
2017-02-27 20:18 ` [PATCH v3 " Eric Dumazet
2017-02-27 22:14 ` Stephen Hemminger
2017-02-28 16:17 ` Stephen Hemminger
@ 2017-02-28 18:34 ` Eric Dumazet
2017-03-01 17:53 ` David Miller
2 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 18:34 UTC (permalink / raw)
To: David Miller, Alexander Duyck; +Cc: netdev
From: Eric Dumazet <edumazet@google.com>
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
In v4, I added the ideas given by Alexander Duyck in v3 review
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
---
include/linux/netdevice.h | 29 ++++---------
net/core/dev.c | 76 ++++++++++++++++++++++++++++++++++--
2 files changed, 81 insertions(+), 24 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
enum {
NAPI_STATE_SCHED, /* Poll is scheduled */
+ NAPI_STATE_MISSED, /* reschedule a napi */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
NAPI_STATE_HASHED, /* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
};
enum {
- NAPIF_STATE_SCHED = (1UL << NAPI_STATE_SCHED),
- NAPIF_STATE_DISABLE = (1UL << NAPI_STATE_DISABLE),
- NAPIF_STATE_NPSVC = (1UL << NAPI_STATE_NPSVC),
- NAPIF_STATE_HASHED = (1UL << NAPI_STATE_HASHED),
- NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
- NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+ NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED),
+ NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED),
+ NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE),
+ NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC),
+ NAPIF_STATE_HASHED = BIT(NAPI_STATE_HASHED),
+ NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+ NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
};
enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
return test_bit(NAPI_STATE_DISABLE, &n->state);
}
-/**
- * napi_schedule_prep - check if NAPI can be scheduled
- * @n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running. This is used as a condition variable to
- * insure only one NAPI poll instance runs. We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
- return !napi_disable_pending(n) &&
- !test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
/**
* napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..ab4fe5304834e43790fa023d99d7fd1844f8f728 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,39 @@ void __napi_schedule(struct napi_struct *n)
EXPORT_SYMBOL(__napi_schedule);
/**
+ * napi_schedule_prep - check if napi can be scheduled
+ * @n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running. This is used as a condition variable
+ * insure only one NAPI poll instance runs. We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+ unsigned long val, new;
+
+ do {
+ val = READ_ONCE(n->state);
+ if (unlikely(val & NAPIF_STATE_DISABLE))
+ return false;
+ new = val | NAPIF_STATE_SCHED;
+
+ /* Sets STATE_MISSED bit if STATE_SCHED was already set
+ * This was suggested by Alexander Duyck, as compiler
+ * emits better code than :
+ * if (val & NAPIF_STATE_SCHED)
+ * new |= NAPIF_STATE_MISSED;
+ */
+ new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED *
+ NAPIF_STATE_MISSED;
+ } while (cmpxchg(&n->state, val, new) != val);
+
+ return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
* __napi_schedule_irqoff - schedule for receive
* @n: entry to schedule
*
@@ -4897,7 +4930,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
bool napi_complete_done(struct napi_struct *n, int work_done)
{
- unsigned long flags;
+ unsigned long flags, val, new;
/*
* 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4960,27 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
list_del_init(&n->poll_list);
local_irq_restore(flags);
}
- WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+ do {
+ val = READ_ONCE(n->state);
+
+ WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+ new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+ /* If STATE_MISSED was set, leave STATE_SCHED set,
+ * because we will call napi->poll() one more time.
+ * This C code was suggested by Alexander Duyck to help gcc.
+ */
+ new |= (val & NAPIF_STATE_MISSED) / NAPIF_STATE_MISSED *
+ NAPIF_STATE_SCHED;
+ } while (cmpxchg(&n->state, val, new) != val);
+
+ if (unlikely(val & NAPIF_STATE_MISSED)) {
+ __napi_schedule(n);
+ return false;
+ }
+
return true;
}
EXPORT_SYMBOL(napi_complete_done);
@@ -4953,6 +5006,16 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
{
int rc;
+ /* Busy polling means there is a high chance device driver hard irq
+ * could not grab NAPI_STATE_SCHED, and that NAPI_STATE_MISSED was
+ * set in napi_schedule_prep().
+ * Since we are about to call napi->poll() once more, we can safely
+ * clear NAPI_STATE_MISSED.
+ *
+ * Note: x86 could use a single "lock and ..." instruction
+ * to perform these two clear_bit()
+ */
+ clear_bit(NAPI_STATE_MISSED, &napi->state);
clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
local_bh_disable();
@@ -5088,8 +5151,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
struct napi_struct *napi;
napi = container_of(timer, struct napi_struct, timer);
- if (napi->gro_list)
- napi_schedule_irqoff(napi);
+
+ /* Note : we use a relaxed variant of napi_schedule_prep() not setting
+ * NAPI_STATE_MISSED, since we do not react to a device IRQ.
+ */
+ if (napi->gro_list && !napi_disable_pending(napi) &&
+ !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
+ __napi_schedule_irqoff(napi);
return HRTIMER_NORESTART;
}
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v4 net] net: solve a NAPI race
2017-02-28 18:34 ` [PATCH v4 " Eric Dumazet
@ 2017-03-01 17:53 ` David Miller
0 siblings, 0 replies; 34+ messages in thread
From: David Miller @ 2017-03-01 17:53 UTC (permalink / raw)
To: eric.dumazet; +Cc: alexander.duyck, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Feb 2017 10:34:50 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
>
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied, thanks Eric.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-27 14:21 ` [PATCH v2 " Eric Dumazet
2017-02-27 16:19 ` David Miller
2017-02-27 20:18 ` [PATCH v3 " Eric Dumazet
@ 2017-02-28 17:20 ` Alexander Duyck
2017-02-28 17:47 ` Eric Dumazet
2017-03-01 10:41 ` David Laight
2 siblings, 2 replies; 34+ messages in thread
From: Alexander Duyck @ 2017-02-28 17:20 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed
On Mon, Feb 27, 2017 at 6:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
>
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
>
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED
>
> Then napi_complete_done() properly reschedules the napi to make sure
> we do not miss something.
>
> Since we manipulate multiple bits at once, use cmpxchg() like in
> sk_busy_loop() to provide proper transactions.
>
> In v2, I changed napi_watchdog() to use a relaxed variant of
> napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/linux/netdevice.h | 29 ++++++-------------
> net/core/dev.c | 53 +++++++++++++++++++++++++++++++++---
> 2 files changed, 58 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -330,6 +330,7 @@ struct napi_struct {
>
> enum {
> NAPI_STATE_SCHED, /* Poll is scheduled */
> + NAPI_STATE_MISSED, /* reschedule a napi */
> NAPI_STATE_DISABLE, /* Disable pending */
> NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
> NAPI_STATE_HASHED, /* In NAPI hash (busy polling possible) */
> @@ -338,12 +339,13 @@ enum {
> };
>
> enum {
> - NAPIF_STATE_SCHED = (1UL << NAPI_STATE_SCHED),
> - NAPIF_STATE_DISABLE = (1UL << NAPI_STATE_DISABLE),
> - NAPIF_STATE_NPSVC = (1UL << NAPI_STATE_NPSVC),
> - NAPIF_STATE_HASHED = (1UL << NAPI_STATE_HASHED),
> - NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
> - NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
> + NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED),
> + NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED),
> + NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE),
> + NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC),
> + NAPIF_STATE_HASHED = BIT(NAPI_STATE_HASHED),
> + NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
> + NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
> };
>
> enum gro_result {
> @@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
> return test_bit(NAPI_STATE_DISABLE, &n->state);
> }
>
> -/**
> - * napi_schedule_prep - check if NAPI can be scheduled
> - * @n: NAPI context
> - *
> - * Test if NAPI routine is already running, and if not mark
> - * it as running. This is used as a condition variable to
> - * insure only one NAPI poll instance runs. We also make
> - * sure there is no pending NAPI disable.
> - */
> -static inline bool napi_schedule_prep(struct napi_struct *n)
> -{
> - return !napi_disable_pending(n) &&
> - !test_and_set_bit(NAPI_STATE_SCHED, &n->state);
> -}
> +bool napi_schedule_prep(struct napi_struct *n);
>
> /**
> * napi_schedule - schedule NAPI poll
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 304f2deae5f9897e60a79ed8b69d6ef208295ded..edeb916487015f279036ecf7ff5d9096dff365d3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
> EXPORT_SYMBOL(__napi_schedule);
>
> /**
> + * napi_schedule_prep - check if napi can be scheduled
> + * @n: napi context
> + *
> + * Test if NAPI routine is already running, and if not mark
> + * it as running. This is used as a condition variable
> + * insure only one NAPI poll instance runs. We also make
> + * sure there is no pending NAPI disable.
> + */
> +bool napi_schedule_prep(struct napi_struct *n)
> +{
> + unsigned long val, new;
> +
> + do {
> + val = READ_ONCE(n->state);
> + if (unlikely(val & NAPIF_STATE_DISABLE))
> + return false;
> + new = val | NAPIF_STATE_SCHED;
> + if (unlikely(val & NAPIF_STATE_SCHED))
> + new |= NAPIF_STATE_MISSED;
You might want to consider just using a combination AND, divide,
multiply, and OR to avoid having to have any conditional branches
being added due to this code path. Basically the logic would look
like:
new = val | NAPIF_STATE_SCHED;
new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
In assembler that all ends up getting translated out to AND, SHL, OR.
You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
up with otherwise.
> + } while (cmpxchg(&n->state, val, new) != val);
> +
> + return !(val & NAPIF_STATE_SCHED);
> +}
> +EXPORT_SYMBOL(napi_schedule_prep);
> +
> +/**
> * __napi_schedule_irqoff - schedule for receive
> * @n: entry to schedule
> *
> @@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
>
> bool napi_complete_done(struct napi_struct *n, int work_done)
> {
> - unsigned long flags;
> + unsigned long flags, val, new;
>
> /*
> * 1) Don't let napi dequeue from the cpu poll list
> @@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> list_del_init(&n->poll_list);
> local_irq_restore(flags);
> }
> - WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
> +
> + do {
> + val = READ_ONCE(n->state);
> +
> + WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
> +
> + new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
> +
> + if (unlikely(val & NAPIF_STATE_MISSED))
> + new |= NAPIF_STATE_SCHED;
Same kind of thing here. You could probably just get away with something like:
new = val & ~NAPIF_STATE_MISSED;
new &= (val & NAPIF_STATE_MISSED) / NAPIF_STATE_MISSED * NETIF_STATE_SCHED;
> + } while (cmpxchg(&n->state, val, new) != val);
> +
> + if (unlikely(val & NAPIF_STATE_MISSED))
> + __napi_schedule(n);
> +
> return true;
> }
If you rescheduled napi should you really be returning true? Seems
like you should be returning "!(val & NAPIF_STATE_MISSED)" to try to
avoid letting this occur again.
> EXPORT_SYMBOL(napi_complete_done);
> @@ -5088,8 +5128,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
> struct napi_struct *napi;
>
> napi = container_of(timer, struct napi_struct, timer);
> - if (napi->gro_list)
> - napi_schedule_irqoff(napi);
> +
> + /* Note : we use a relaxed variant of napi_schedule_prep() not setting
> + * NAPI_STATE_MISSED, since we do not react to a device IRQ.
> + */
> + if (napi->gro_list && !napi_disable_pending(napi) &&
> + !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
> + __napi_schedule_irqoff(napi);
>
> return HRTIMER_NORESTART;
> }
>
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-02-28 17:20 ` [PATCH v2 " Alexander Duyck
@ 2017-02-28 17:47 ` Eric Dumazet
2017-03-01 10:41 ` David Laight
1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 17:47 UTC (permalink / raw)
To: Alexander Duyck; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed
On Tue, 2017-02-28 at 09:20 -0800, Alexander Duyck wrote:
> On Mon, Feb 27, 2017 at 6:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > +bool napi_schedule_prep(struct napi_struct *n)
> > +{
> > + unsigned long val, new;
> > +
> > + do {
> > + val = READ_ONCE(n->state);
> > + if (unlikely(val & NAPIF_STATE_DISABLE))
> > + return false;
> > + new = val | NAPIF_STATE_SCHED;
> > + if (unlikely(val & NAPIF_STATE_SCHED))
> > + new |= NAPIF_STATE_MISSED;
>
> You might want to consider just using a combination AND, divide,
> multiply, and OR to avoid having to have any conditional branches
> being added due to this code path. Basically the logic would look
> like:
> new = val | NAPIF_STATE_SCHED;
> new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
>
> In assembler that all ends up getting translated out to AND, SHL, OR.
> You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
> up with otherwise.
Sure, I can try to optimize this a bit ;)
> > + } while (cmpxchg(&n->state, val, new) != val);
> > +
> > + if (unlikely(val & NAPIF_STATE_MISSED))
> > + __napi_schedule(n);
> > +
> > return true;
> > }
>
> If you rescheduled napi should you really be returning true? Seems
> like you should be returning "!(val & NAPIF_STATE_MISSED)" to try to
> avoid letting this occur again.
Good idea.
Hmm... you mean that many drivers test napi_complete_done() return
value ?
;)
Thanks !
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: [PATCH v2 net] net: solve a NAPI race
2017-02-28 17:20 ` [PATCH v2 " Alexander Duyck
2017-02-28 17:47 ` Eric Dumazet
@ 2017-03-01 10:41 ` David Laight
2017-03-01 16:14 ` Alexander Duyck
1 sibling, 1 reply; 34+ messages in thread
From: David Laight @ 2017-03-01 10:41 UTC (permalink / raw)
To: 'Alexander Duyck', Eric Dumazet
Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed
From: Alexander Duyck
> Sent: 28 February 2017 17:20
...
> You might want to consider just using a combination AND, divide,
> multiply, and OR to avoid having to have any conditional branches
> being added due to this code path. Basically the logic would look
> like:
> new = val | NAPIF_STATE_SCHED;
> new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
>
> In assembler that all ends up getting translated out to AND, SHL, OR.
> You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
> up with otherwise.
It is a shame gcc doesn't contain that optimisation.
It also doesn't even make a good job of (a & b)/b * c since it
always does a shr and a sal (gcc 4.7.3 and 5.4).
Worthy of a #define or static inline.
Something like:
#define BIT_IF(v, a, b) ((b & (b-1) ? (v & a)/a * b : a > b ? (v & a) / (a/b) : (v & a) * (b/a))
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-03-01 10:41 ` David Laight
@ 2017-03-01 16:14 ` Alexander Duyck
2017-03-01 17:32 ` Eric Dumazet
0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2017-03-01 16:14 UTC (permalink / raw)
To: David Laight
Cc: Eric Dumazet, David Miller, netdev, Tariq Toukan, Saeed Mahameed
On Wed, Mar 1, 2017 at 2:41 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexander Duyck
>> Sent: 28 February 2017 17:20
> ...
>> You might want to consider just using a combination AND, divide,
>> multiply, and OR to avoid having to have any conditional branches
>> being added due to this code path. Basically the logic would look
>> like:
>> new = val | NAPIF_STATE_SCHED;
>> new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
>>
>> In assembler that all ends up getting translated out to AND, SHL, OR.
>> You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
>> up with otherwise.
>
> It is a shame gcc doesn't contain that optimisation.
> It also doesn't even make a good job of (a & b)/b * c since it
> always does a shr and a sal (gcc 4.7.3 and 5.4).
What build flags are you using? With -Os or -O2 I have seen it
convert the /b * c into a single shift.
> Worthy of a #define or static inline.
> Something like:
> #define BIT_IF(v, a, b) ((b & (b-1) ? (v & a)/a * b : a > b ? (v & a) / (a/b) : (v & a) * (b/a))
>
> David
Feel free to put together a patch. I use this kind of thing in the
Intel drivers in multiple spots to shift stuff from TX_FLAGS into
descriptor flags.
- Alex
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2 net] net: solve a NAPI race
2017-03-01 16:14 ` Alexander Duyck
@ 2017-03-01 17:32 ` Eric Dumazet
2017-03-02 10:24 ` David Laight
0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2017-03-01 17:32 UTC (permalink / raw)
To: Alexander Duyck
Cc: David Laight, David Miller, netdev, Tariq Toukan, Saeed Mahameed
On Wed, 2017-03-01 at 08:14 -0800, Alexander Duyck wrote:
> What build flags are you using? With -Os or -O2 I have seen it
> convert the /b * c into a single shift.
>
Because b & c are unsigned in our case.
I presume David tried signed integers, this is why gcc does that.
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: [PATCH v2 net] net: solve a NAPI race
2017-03-01 17:32 ` Eric Dumazet
@ 2017-03-02 10:24 ` David Laight
0 siblings, 0 replies; 34+ messages in thread
From: David Laight @ 2017-03-02 10:24 UTC (permalink / raw)
To: 'Eric Dumazet', Alexander Duyck
Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed
From: Eric Dumazet
> Sent: 01 March 2017 17:33
> On Wed, 2017-03-01 at 08:14 -0800, Alexander Duyck wrote:
>
> > What build flags are you using? With -Os or -O2 I have seen it
> > convert the /b * c into a single shift.
> >
>
>
> Because b & c are unsigned in our case.
>
> I presume David tried signed integers, this is why gcc does that.
I was using integer constants but an unsigned variable.
Changing the constants to unsigned makes no difference (they would get
promoted anyway).
After some experiments I can get gcc to generate a single shift if the
variable being tested is 32bits, the constants are 64bits and a 64bit
result is required.
In every other case it does either and-shr-shl or shr-and-shl (all the
shr are logical).
David
^ permalink raw reply [flat|nested] 34+ messages in thread