[Intel-wired-lan] Question about ixgbe RESET due to lost link

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
@ 2016-12-02  0:42 Ruslan Nikolaev
  2016-12-02  2:13 ` Alexander Duyck
  0 siblings, 1 reply; 7+ messages in thread
From: Ruslan Nikolaev @ 2016-12-02  0:42 UTC (permalink / raw)
  To: intel-wired-lan

While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes we get ?Reset adapter? message 'due to lost link with pending Tx work?.

The problem is that when handling the VF reset message that arrives through a mailbox (in the corresponding dpdk handler), the link may already be down. Therefore, we are unable to properly reset the device. While looking at the ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case, set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The latter will only do anything if the link is not already down.

I guess, my question is why we are setting it when detecting that the link is down. It is going to be down anyway. Can the actual reset take place when the link is up again?

Thank you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20161201/466ceb23/attachment-0001.html>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
  2016-12-02  0:42 [Intel-wired-lan] Question about ixgbe RESET due to lost link Ruslan Nikolaev
@ 2016-12-02  2:13 ` Alexander Duyck
  2016-12-02  2:31   ` Ruslan Nikolaev
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Duyck @ 2016-12-02  2:13 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Dec 1, 2016 at 4:42 PM, Ruslan Nikolaev <ruslan@purestorage.com> wrote:
> While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes
> we get ?Reset adapter? message 'due to lost link with pending Tx work?.
>
> The problem is that when handling the VF reset message that arrives through
> a mailbox (in the corresponding dpdk handler), the link may already be down.
> Therefore, we are unable to properly reset the device. While looking at the
> ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case,
> set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The
> latter will only do anything if the link is not already down.

Why can't you properly reset the device?  The PF should have already
taken care of resetting the queues when it did the reset itself.  All
that should be left to do is for the VF to reinitialize the queues so
that they are re-enabled after the reset.

> I guess, my question is why we are setting it when detecting that the link
> is down. It is going to be down anyway. Can the actual reset take place when
> the link is up again?
>
> Thank you!

The short answer to this is "no".

What it all comes down to is that we have to flush the Tx queues when
the link goes down to get rid of stale data.  We need to go through
and clean out the Tx rings so that the Tx and Rx FIFOs are cleared and
ready to go when the link comes back up.  We can't reset the part
after link up because by that point the link has already come back up
and the stale data is likely already moving through queues.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
  2016-12-02  2:13 ` Alexander Duyck
@ 2016-12-02  2:31   ` Ruslan Nikolaev
  2016-12-02 21:48     ` Alexander Duyck
  0 siblings, 1 reply; 7+ messages in thread
From: Ruslan Nikolaev @ 2016-12-02  2:31 UTC (permalink / raw)
  To: intel-wired-lan

Thank you for your response! I still have questions below.

> 
>> While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes
>> we get ?Reset adapter? message 'due to lost link with pending Tx work?.
>> 
>> The problem is that when handling the VF reset message that arrives through
>> a mailbox (in the corresponding dpdk handler), the link may already be down.
>> Therefore, we are unable to properly reset the device. While looking at the
>> ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case,
>> set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The
>> latter will only do anything if the link is not already down.
> 
> Why can't you properly reset the device?  The PF should have already
> taken care of resetting the queues when it did the reset itself.  All
> that should be left to do is for the VF to reinitialize the queues so
> that they are re-enabled after the reset.
> 

I guess, the problem is that we can stop a device but we are unable to start the device properly (because some registers are unavailable?) Particularly, in the patch for DPDK:
http://dpdk.org/dev/patchwork/patch/14009/

I see error messages ?Failed to update link.? If I understand correctly, this patch introduced a delay (1000 ms) to make sure that the link is up again. It also checks one register in a busy-wait loop (see comment: "When the PF link is down? VF cannot operate its registers?). But the problem here is that there might be completely arbitrary time between link going down and up again (minutes, hours, etc), so I cannot be sitting in a busy-wait loop like this.


>> I guess, my question is why we are setting it when detecting that the link
>> is down. It is going to be down anyway. Can the actual reset take place when
>> the link is up again?
>> 
>> Thank you!
> 
> The short answer to this is "no".
> 
> What it all comes down to is that we have to flush the Tx queues when
> the link goes down to get rid of stale data.  We need to go through
> and clean out the Tx rings so that the Tx and Rx FIFOs are cleared and
> ready to go when the link comes back up.  We can't reset the part
> after link up because by that point the link has already come back up
> and the stale data is likely already moving through queues.

Ok, I see. What happens if the stale data (Tx) moves through queues and is actually sent. Is that a problem? Why do we need to reset queues? (Sorry if it is a silly question but just trying to understand why we are doing it in the first place.)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
  2016-12-02  2:31   ` Ruslan Nikolaev
@ 2016-12-02 21:48     ` Alexander Duyck
  2016-12-02 22:15       ` Ruslan Nikolaev
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Duyck @ 2016-12-02 21:48 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Dec 1, 2016 at 6:31 PM, Ruslan Nikolaev <ruslan@purestorage.com> wrote:
> Thank you for your response! I still have questions below.
>
>>
>>> While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes
>>> we get ?Reset adapter? message 'due to lost link with pending Tx work?.
>>>
>>> The problem is that when handling the VF reset message that arrives through
>>> a mailbox (in the corresponding dpdk handler), the link may already be down.
>>> Therefore, we are unable to properly reset the device. While looking at the
>>> ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case,
>>> set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The
>>> latter will only do anything if the link is not already down.
>>
>> Why can't you properly reset the device?  The PF should have already
>> taken care of resetting the queues when it did the reset itself.  All
>> that should be left to do is for the VF to reinitialize the queues so
>> that they are re-enabled after the reset.
>>
>
> I guess, the problem is that we can stop a device but we are unable to start the device properly (because some registers are unavailable?) Particularly, in the patch for DPDK:
> http://dpdk.org/dev/patchwork/patch/14009/
>
> I see error messages ?Failed to update link.? If I understand correctly, this patch introduced a delay (1000 ms) to make sure that the link is up again. It also checks one register in a busy-wait loop (see comment: "When the PF link is down? VF cannot operate its registers?). But the problem here is that there might be completely arbitrary time between link going down and up again (minutes, hours, etc), so I cannot be sitting in a busy-wait loop like this.

This doesn't sound right to me.  So DPDK is expecting the link to
always be up?  That isn't always going to be the case.  It seems like
DPDK should figure out a way to enable interrupts and wait for the
mailbox notification that the link has come back up.

>>> I guess, my question is why we are setting it when detecting that the link
>>> is down. It is going to be down anyway. Can the actual reset take place when
>>> the link is up again?
>>>
>>> Thank you!
>>
>> The short answer to this is "no".
>>
>> What it all comes down to is that we have to flush the Tx queues when
>> the link goes down to get rid of stale data.  We need to go through
>> and clean out the Tx rings so that the Tx and Rx FIFOs are cleared and
>> ready to go when the link comes back up.  We can't reset the part
>> after link up because by that point the link has already come back up
>> and the stale data is likely already moving through queues.
>
> Ok, I see. What happens if the stale data (Tx) moves through queues and is actually sent. Is that a problem? Why do we need to reset queues? (Sorry if it is a silly question but just trying to understand why we are doing it in the first place.)
>

There ends up being a few different things that could happen depending
on the hardware.  In some cases it can get as bad as Tx hangs or data
corruptions.  Generally you don't want the driver sitting on the
memory in the Tx rings.  You want it to flush the memory and just wait
until the link has come back before we start queuing packets again.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
  2016-12-02 21:48     ` Alexander Duyck
@ 2016-12-02 22:15       ` Ruslan Nikolaev
  2016-12-02 22:32         ` Alexander Duyck
  0 siblings, 1 reply; 7+ messages in thread
From: Ruslan Nikolaev @ 2016-12-02 22:15 UTC (permalink / raw)
  To: intel-wired-lan

Thanks! I have one more question below.

> 
> 
>> Thank you for your response! I still have questions below.
>> 
>>> 
>>>> While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes
>>>> we get ?Reset adapter? message 'due to lost link with pending Tx work?.
>>>> 
>>>> The problem is that when handling the VF reset message that arrives through
>>>> a mailbox (in the corresponding dpdk handler), the link may already be down.
>>>> Therefore, we are unable to properly reset the device. While looking at the
>>>> ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case,
>>>> set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The
>>>> latter will only do anything if the link is not already down.
>>> 
>>> Why can't you properly reset the device?  The PF should have already
>>> taken care of resetting the queues when it did the reset itself.  All
>>> that should be left to do is for the VF to reinitialize the queues so
>>> that they are re-enabled after the reset.
>>> 
>> 
>> I guess, the problem is that we can stop a device but we are unable to start the device properly (because some registers are unavailable?) Particularly, in the patch for DPDK:
>> http://dpdk.org/dev/patchwork/patch/14009/
>> 
>> I see error messages ?Failed to update link.? If I understand correctly, this patch introduced a delay (1000 ms) to make sure that the link is up again. It also checks one register in a busy-wait loop (see comment: "When the PF link is down? VF cannot operate its registers?). But the problem here is that there might be completely arbitrary time between link going down and up again (minutes, hours, etc), so I cannot be sitting in a busy-wait loop like this.
> 
> This doesn't sound right to me.  So DPDK is expecting the link to
> always be up?  That isn't always going to be the case.  It seems like
> DPDK should figure out a way to enable interrupts and wait for the
> mailbox notification that the link has come back up.
> 

Are you proposing to split reset logic (from the patch) into 2 parts?
1. Always stop the device on the reset adapter notification
2. Start the device on the link up notification or if it is already up


>>>> I guess, my question is why we are setting it when detecting that the link
>>>> is down. It is going to be down anyway. Can the actual reset take place when
>>>> the link is up again?
>>>> 
>>>> Thank you!
>>> 
>>> The short answer to this is "no".
>>> 
>>> What it all comes down to is that we have to flush the Tx queues when
>>> the link goes down to get rid of stale data.  We need to go through
>>> and clean out the Tx rings so that the Tx and Rx FIFOs are cleared and
>>> ready to go when the link comes back up.  We can't reset the part
>>> after link up because by that point the link has already come back up
>>> and the stale data is likely already moving through queues.
>> 
>> Ok, I see. What happens if the stale data (Tx) moves through queues and is actually sent. Is that a problem? Why do we need to reset queues? (Sorry if it is a silly question but just trying to understand why we are doing it in the first place.)
>> 
> 
> There ends up being a few different things that could happen depending
> on the hardware.  In some cases it can get as bad as Tx hangs or data
> corruptions.  Generally you don't want the driver sitting on the
> memory in the Tx rings.  You want it to flush the memory and just wait
> until the link has come back before we start queuing packets again.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
  2016-12-02 22:15       ` Ruslan Nikolaev
@ 2016-12-02 22:32         ` Alexander Duyck
  2016-12-03  2:03           ` Ruslan Nikolaev
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Duyck @ 2016-12-02 22:32 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, Dec 2, 2016 at 2:15 PM, Ruslan Nikolaev <ruslan@purestorage.com> wrote:
> Thanks! I have one more question below.
>
>>
>>
>>> Thank you for your response! I still have questions below.
>>>
>>>>
>>>>> While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes
>>>>> we get ?Reset adapter? message 'due to lost link with pending Tx work?.
>>>>>
>>>>> The problem is that when handling the VF reset message that arrives through
>>>>> a mailbox (in the corresponding dpdk handler), the link may already be down.
>>>>> Therefore, we are unable to properly reset the device. While looking at the
>>>>> ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case,
>>>>> set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The
>>>>> latter will only do anything if the link is not already down.
>>>>
>>>> Why can't you properly reset the device?  The PF should have already
>>>> taken care of resetting the queues when it did the reset itself.  All
>>>> that should be left to do is for the VF to reinitialize the queues so
>>>> that they are re-enabled after the reset.
>>>>
>>>
>>> I guess, the problem is that we can stop a device but we are unable to start the device properly (because some registers are unavailable?) Particularly, in the patch for DPDK:
>>> http://dpdk.org/dev/patchwork/patch/14009/
>>>
>>> I see error messages ?Failed to update link.? If I understand correctly, this patch introduced a delay (1000 ms) to make sure that the link is up again. It also checks one register in a busy-wait loop (see comment: "When the PF link is down? VF cannot operate its registers?). But the problem here is that there might be completely arbitrary time between link going down and up again (minutes, hours, etc), so I cannot be sitting in a busy-wait loop like this.
>>
>> This doesn't sound right to me.  So DPDK is expecting the link to
>> always be up?  That isn't always going to be the case.  It seems like
>> DPDK should figure out a way to enable interrupts and wait for the
>> mailbox notification that the link has come back up.
>>
>
> Are you proposing to split reset logic (from the patch) into 2 parts?
> 1. Always stop the device on the reset adapter notification
> 2. Start the device on the link up notification or if it is already up

I can't say for certain as I haven't worked in DPDK all that much.
However all of our kernel based VF drivers basically work on that kind
of logic.  They will receive the reset notification but they will wait
until the link up notification is received before they decide to
reset.  The trick is we do the reset notifcation based on the call to
ixgbevf_check_mac_link_vf() instead of just resetting if the link goes
down.  The PF takes care of the reset on its end if it is resetting
things due to the link going down, then the VF should reset when the
link comes back up.  When the link goes down though we need to make
certain to turn off all the watchdogs and such.

- Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Intel-wired-lan] Question about ixgbe RESET due to lost link
  2016-12-02 22:32         ` Alexander Duyck
@ 2016-12-03  2:03           ` Ruslan Nikolaev
  0 siblings, 0 replies; 7+ messages in thread
From: Ruslan Nikolaev @ 2016-12-03  2:03 UTC (permalink / raw)
  To: intel-wired-lan

Thanks! Do you rely on the mailbox reset notification at all in the ixgbevf? Or is it just resetting it when the link seems to be down? 

Also, do you mean that the reset procedure is completely postponed for VFs (including stopping the corresponding device per DPDK terminology) until after the link comes back up?


>> Thanks! I have one more question below.
>> 
>>> 
>>> 
>>>> Thank you for your response! I still have questions below.
>>>> 
>>>>> 
>>>>>> While working with ixgbe (PF) and dpdk (VF), I have noticed that sometimes
>>>>>> we get ?Reset adapter? message 'due to lost link with pending Tx work?.
>>>>>> 
>>>>>> The problem is that when handling the VF reset message that arrives through
>>>>>> a mailbox (in the corresponding dpdk handler), the link may already be down.
>>>>>> Therefore, we are unable to properly reset the device. While looking at the
>>>>>> ixgbe code, I have noticed that IXGBE_FLAG2_RESET_REQUESTED (in this case,
>>>>>> set in ixgbe_watchdog_flush_tx) is checked in ixgbe_reset_subtask. The
>>>>>> latter will only do anything if the link is not already down.
>>>>> 
>>>>> Why can't you properly reset the device?  The PF should have already
>>>>> taken care of resetting the queues when it did the reset itself.  All
>>>>> that should be left to do is for the VF to reinitialize the queues so
>>>>> that they are re-enabled after the reset.
>>>>> 
>>>> 
>>>> I guess, the problem is that we can stop a device but we are unable to start the device properly (because some registers are unavailable?) Particularly, in the patch for DPDK:
>>>> http://dpdk.org/dev/patchwork/patch/14009/
>>>> 
>>>> I see error messages ?Failed to update link.? If I understand correctly, this patch introduced a delay (1000 ms) to make sure that the link is up again. It also checks one register in a busy-wait loop (see comment: "When the PF link is down? VF cannot operate its registers?). But the problem here is that there might be completely arbitrary time between link going down and up again (minutes, hours, etc), so I cannot be sitting in a busy-wait loop like this.
>>> 
>>> This doesn't sound right to me.  So DPDK is expecting the link to
>>> always be up?  That isn't always going to be the case.  It seems like
>>> DPDK should figure out a way to enable interrupts and wait for the
>>> mailbox notification that the link has come back up.
>>> 
>> 
>> Are you proposing to split reset logic (from the patch) into 2 parts?
>> 1. Always stop the device on the reset adapter notification
>> 2. Start the device on the link up notification or if it is already up
> 
> I can't say for certain as I haven't worked in DPDK all that much.
> However all of our kernel based VF drivers basically work on that kind
> of logic.  They will receive the reset notification but they will wait
> until the link up notification is received before they decide to
> reset.  The trick is we do the reset notifcation based on the call to
> ixgbevf_check_mac_link_vf() instead of just resetting if the link goes
> down.  The PF takes care of the reset on its end if it is resetting
> things due to the link going down, then the VF should reset when the
> link comes back up.  When the link goes down though we need to make
> certain to turn off all the watchdogs and such.
> 
> - Alex


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-12-03  2:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02  0:42 [Intel-wired-lan] Question about ixgbe RESET due to lost link Ruslan Nikolaev
2016-12-02  2:13 ` Alexander Duyck
2016-12-02  2:31   ` Ruslan Nikolaev
2016-12-02 21:48     ` Alexander Duyck
2016-12-02 22:15       ` Ruslan Nikolaev
2016-12-02 22:32         ` Alexander Duyck
2016-12-03  2:03           ` Ruslan Nikolaev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.