netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
@ 2021-04-20 21:35 Dany Madden
  2021-04-20 21:42 ` Lijun Pan
  2021-04-22  5:30 ` Lijun Pan
  0 siblings, 2 replies; 16+ messages in thread
From: Dany Madden @ 2021-04-20 21:35 UTC (permalink / raw)
  To: davem, kuba
  Cc: drt, sukadev, tlfalcon, mpe, benh, paulus, netdev, linuxppc-dev

When ibmvnic gets a FATAL error message from the vnicserver, it marks
the Command Respond Queue (CRQ) inactive and resets the adapter. If this
FATAL reset fails and a transmission timeout reset follows, the CRQ is
still inactive, ibmvnic's attempt to set link down will also fail. If
ibmvnic abandons the reset because of this failed set link down and this
is the last reset in the workqueue, then this adapter will be left in an
inoperable state.

Instead, make the driver ignore this link down failure and continue to
free and re-register CRQ so that the adapter has an opportunity to
recover.

Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
---
Changes in V2:
- Update description to clarify background for the patch
- Include Reviewed-by tags
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index ffb2a91750c7..4bd8c5d1a275 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 			rtnl_unlock();
 			rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
 			rtnl_lock();
-			if (rc)
-				goto out;
+			if (rc) {
+				netdev_dbg(netdev,
+					   "Setting link down failed rc=%d. Continue anyway\n", rc);
+			}
 
 			if (adapter->state == VNIC_OPEN) {
 				/* When we dropped rtnl, ibmvnic_open() got
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-20 21:35 [PATCH V2 net] ibmvnic: Continue with reset if set link down failed Dany Madden
@ 2021-04-20 21:42 ` Lijun Pan
  2021-04-21  6:45   ` Sukadev Bhattiprolu
  2021-04-21  7:54   ` Rick Lindsley
  2021-04-22  5:30 ` Lijun Pan
  1 sibling, 2 replies; 16+ messages in thread
From: Lijun Pan @ 2021-04-20 21:42 UTC (permalink / raw)
  To: Dany Madden
  Cc: David Miller, Jakub Kicinski, Tom Falcon, netdev, paulus,
	Sukadev Bhattiprolu, linuxppc-dev



> On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> 
> When ibmvnic gets a FATAL error message from the vnicserver, it marks
> the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> FATAL reset fails and a transmission timeout reset follows, the CRQ is
> still inactive, ibmvnic's attempt to set link down will also fail. If
> ibmvnic abandons the reset because of this failed set link down and this
> is the last reset in the workqueue, then this adapter will be left in an
> inoperable state.
> 
> Instead, make the driver ignore this link down failure and continue to
> free and re-register CRQ so that the adapter has an opportunity to
> recover.

This v2 does not adddress the concerns mentioned in v1.
And I think it is better to exit with error from do_reset, and schedule a thorough
do_hard_reset if the the adapter is already in unstable state.

> 
> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> Signed-off-by: Dany Madden <drt@linux.ibm.com>
> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
> ---
> Changes in V2:
> - Update description to clarify background for the patch
> - Include Reviewed-by tags
> ---
> drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index ffb2a91750c7..4bd8c5d1a275 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter,
> 			rtnl_unlock();
> 			rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
> 			rtnl_lock();
> -			if (rc)
> -				goto out;
> +			if (rc) {
> +				netdev_dbg(netdev,
> +					   "Setting link down failed rc=%d. Continue anyway\n", rc);
> +			}
> 
> 			if (adapter->state == VNIC_OPEN) {
> 				/* When we dropped rtnl, ibmvnic_open() got
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-20 21:42 ` Lijun Pan
@ 2021-04-21  6:45   ` Sukadev Bhattiprolu
  2021-04-22  5:06     ` Lijun Pan
  2021-04-21  7:54   ` Rick Lindsley
  1 sibling, 1 reply; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2021-04-21  6:45 UTC (permalink / raw)
  To: Lijun Pan
  Cc: Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev,
	paulus, linuxppc-dev

Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> 
> 
> > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > 
> > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > still inactive, ibmvnic's attempt to set link down will also fail. If
> > ibmvnic abandons the reset because of this failed set link down and this
> > is the last reset in the workqueue, then this adapter will be left in an
> > inoperable state.
> > 
> > Instead, make the driver ignore this link down failure and continue to
> > free and re-register CRQ so that the adapter has an opportunity to
> > recover.
> 
> This v2 does not adddress the concerns mentioned in v1.
> And I think it is better to exit with error from do_reset, and schedule a thorough
> do_hard_reset if the the adapter is already in unstable state.

We had a FATAL error and when handling it, we failed to send a 
link-down message to the VIOS. So what we need to try next is to 
reset the connection with the VIOS. For this we must talk to the 
firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
does just that in ibmvnic_reset_crq().

Now, sure we can attempt a "thorough hard reset" which also does
the same hcalls to reestablish the connection. Is there any
other magic in do_hard_reset()? But in addition, it also frees lot
more Linux kernel buffers and reallocates them for instance.

If we are having a communication problem with the VIOS, what is
the point of freeing and reallocating Linux kernel buffers? Beside
being inefficient, it would expose us to even more errors during
reset under heavy workloads?

From what I understand so far, do_reset() is complicated because
it is attempting some optimizations.  If we are going to fall back
to hard reset for every error we might as well drop the do_reset()
and just do the "thorough hard reset" every time right?

The protocol spec is ambiguous and so far I did not get a clear
answer on whether the link-down is even needed. If it is needed,
then should we add it to do_hard_reset() also? If not, we should
remove it (like you mentioned your earlier) completely but am
waiting for confirmation on that. git history has not been helpful.

While there are other rough edges around do_reset() that we are
working on fixing separately (eg: ignore the error return from 
__ibmvnic_close() right above this change) I see a benefit to
the customer with this patch.

I am not convinced we should perform a hard reset just because
the link down failed.

Sukadev

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-20 21:42 ` Lijun Pan
  2021-04-21  6:45   ` Sukadev Bhattiprolu
@ 2021-04-21  7:54   ` Rick Lindsley
  2021-04-22  5:12     ` Lijun Pan
  1 sibling, 1 reply; 16+ messages in thread
From: Rick Lindsley @ 2021-04-21  7:54 UTC (permalink / raw)
  To: Lijun Pan, Dany Madden
  Cc: David Miller, Jakub Kicinski, Tom Falcon, netdev, paulus,
	Sukadev Bhattiprolu, linuxppc-dev

On 4/20/21 2:42 PM, Lijun Pan wrote:
> 
> This v2 does not adddress the concerns mentioned in v1.
> And I think it is better to exit with error from do_reset, and schedule a thorough
> do_hard_reset if the the adapter is already in unstable state.

But the point is that the testing and analysis has indicated that doing a full
hard reset is not necessary. We are about to take the very action which will fix
this situation, but currently do not.

Please describe the advantage in deferring it further by routing it through
do_hard_reset().  I don't see one.

Rick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-21  6:45   ` Sukadev Bhattiprolu
@ 2021-04-22  5:06     ` Lijun Pan
  2021-04-22  6:58       ` Sukadev Bhattiprolu
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Lijun Pan @ 2021-04-22  5:06 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon,
	netdev, Paul Mackerras, linuxppc-dev

On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
<sukadev@linux.ibm.com> wrote:
>
> Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> >
> >
> > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > >
> > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > ibmvnic abandons the reset because of this failed set link down and this
> > > is the last reset in the workqueue, then this adapter will be left in an
> > > inoperable state.
> > >
> > > Instead, make the driver ignore this link down failure and continue to
> > > free and re-register CRQ so that the adapter has an opportunity to
> > > recover.
> >
> > This v2 does not adddress the concerns mentioned in v1.
> > And I think it is better to exit with error from do_reset, and schedule a thorough
> > do_hard_reset if the the adapter is already in unstable state.
>
> We had a FATAL error and when handling it, we failed to send a
> link-down message to the VIOS. So what we need to try next is to
> reset the connection with the VIOS. For this we must talk to the
> firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> does just that in ibmvnic_reset_crq().
>
> Now, sure we can attempt a "thorough hard reset" which also does
> the same hcalls to reestablish the connection. Is there any
> other magic in do_hard_reset()? But in addition, it also frees lot
> more Linux kernel buffers and reallocates them for instance.

Working around everything in do_reset will make the code very difficult
to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
can be removed completely or merged into do_reset.

>
> If we are having a communication problem with the VIOS, what is
> the point of freeing and reallocating Linux kernel buffers? Beside
> being inefficient, it would expose us to even more errors during
> reset under heavy workloads?

No real customer runs the system under that heavy load created by
HTX stress test, which can tear down any working system.

>
> From what I understand so far, do_reset() is complicated because
> it is attempting some optimizations.  If we are going to fall back
> to hard reset for every error we might as well drop the do_reset()
> and just do the "thorough hard reset" every time right?

I think such optimizations are catered for passing HTX tests. Whether
the optimization benefits the adapter, say making the adapter more stable,
I doubt it. I think there should be a trade off between optimization
and stability.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-21  7:54   ` Rick Lindsley
@ 2021-04-22  5:12     ` Lijun Pan
  2021-04-22  7:05       ` Rick Lindsley
  0 siblings, 1 reply; 16+ messages in thread
From: Lijun Pan @ 2021-04-22  5:12 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon,
	netdev, Paul Mackerras, Sukadev Bhattiprolu, linuxppc-dev

On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley
<ricklind@linux.vnet.ibm.com> wrote:
>
> On 4/20/21 2:42 PM, Lijun Pan wrote:
> >
> > This v2 does not adddress the concerns mentioned in v1.
> > And I think it is better to exit with error from do_reset, and schedule a thorough
> > do_hard_reset if the the adapter is already in unstable state.
>
> But the point is that the testing and analysis has indicated that doing a full
> hard reset is not necessary. We are about to take the very action which will fix
> this situation, but currently do not.

The testing was done on this patch. It was not performed on a full hard reset.
So I don't think you could even compare the two results.

>
> Please describe the advantage in deferring it further by routing it through
> do_hard_reset().  I don't see one.

It is not deferred. It exits with error and calls do_hard_reset.
See my reply to Suka's.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-20 21:35 [PATCH V2 net] ibmvnic: Continue with reset if set link down failed Dany Madden
  2021-04-20 21:42 ` Lijun Pan
@ 2021-04-22  5:30 ` Lijun Pan
  2021-04-22  7:07   ` Rick Lindsley
  1 sibling, 1 reply; 16+ messages in thread
From: Lijun Pan @ 2021-04-22  5:30 UTC (permalink / raw)
  To: Dany Madden
  Cc: David S. Miller, Jakub Kicinski, Sukadev Bhattiprolu,
	Thomas Falcon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, netdev, linuxppc-dev

On Tue, Apr 20, 2021 at 4:37 PM Dany Madden <drt@linux.ibm.com> wrote:
>
> When ibmvnic gets a FATAL error message from the vnicserver, it marks
> the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> FATAL reset fails and a transmission timeout reset follows, the CRQ is
> still inactive, ibmvnic's attempt to set link down will also fail. If
> ibmvnic abandons the reset because of this failed set link down and this
> is the last reset in the workqueue, then this adapter will be left in an
> inoperable state.
>
> Instead, make the driver ignore this link down failure and continue to
> free and re-register CRQ so that the adapter has an opportunity to
> recover.
>
> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> Signed-off-by: Dany Madden <drt@linux.ibm.com>
> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>

One thing I would like to point out as already pointed out by Nathan Lynch is
that those review-by tags given by the same groups of people from the same
company loses credibility over time if you never critique or ask
questions on the list.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22  5:06     ` Lijun Pan
@ 2021-04-22  6:58       ` Sukadev Bhattiprolu
  2021-04-22  7:05       ` Rick Lindsley
  2021-04-22 17:21       ` Michal Suchánek
  2 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2021-04-22  6:58 UTC (permalink / raw)
  To: Lijun Pan
  Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon,
	netdev, Paul Mackerras, linuxppc-dev

Lijun Pan [lijunp213@gmail.com] wrote:
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult

We are not working around everything. We are doing in do_reset()
exactly what we would do in hard reset for this error (ignore the
set link down error and try to reestablish the connection with the
VIOS).

What we are avoiding is unnecessary work on the Linux side for a
communication problem on the VIOS side.

> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.
> 
> >
> > If we are having a communication problem with the VIOS, what is
> > the point of freeing and reallocating Linux kernel buffers? Beside
> > being inefficient, it would expose us to even more errors during
> > reset under heavy workloads?
> 
> No real customer runs the system under that heavy load created by
> HTX stress test, which can tear down any working system.

We need to talk to capacity planning and test architects about that,
but all I want to know is what hard reset would do differently to
fix this communication error with VIOS.

Sukadev

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22  5:12     ` Lijun Pan
@ 2021-04-22  7:05       ` Rick Lindsley
  0 siblings, 0 replies; 16+ messages in thread
From: Rick Lindsley @ 2021-04-22  7:05 UTC (permalink / raw)
  To: Lijun Pan
  Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon,
	netdev, Paul Mackerras, Sukadev Bhattiprolu, linuxppc-dev

On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley
<ricklind@linux.vnet.ibm.com> wrote:

>> Please describe the advantage in deferring it further by routing it through
>> do_hard_reset().  I don't see one.

On 4/21/21 10:12 PM, Lijun Pan replied:

> It is not deferred. It exits with error and calls do_hard_reset.
> See my reply to Suka's.

I saw your reply, but it does not answer the question I asked.  The patch
would have us reinitialize and restart the communication queues.  Your
suggestion would do more work than that.  Please describe the advantage
in deferring the reinitialization - and yes, defer is the right word -
by routing it through do_hard_reset() and executing that extra code.  I
see that route as doing more work than necessary and so introducing
additional risk, for no clear advantage.  So I would find it helpful
if you would describe the advantage.

> The testing was done on this patch. It was not performed on a full hard reset.
> So I don't think you could even compare the two results.

A problem has been noted, a patch has been proposed, and the reasoning that
the patch is correct has been given.  Testing with this patch has
demonstrated the problem has not returned.  So far, that sounds like a
pretty reasonable solution.

Your comment is curious - why would testing for this patch be done on a full
hard reset when this patch does not invoke a full hard reset?  If you have
data to consider then let's have it. I'm willing to be convinced, but so far
this just sounds like "I wouldn't do it that way myself, and I have a bad
feeling about doing it any other way."

Rick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22  5:06     ` Lijun Pan
  2021-04-22  6:58       ` Sukadev Bhattiprolu
@ 2021-04-22  7:05       ` Rick Lindsley
  2021-04-22 17:21       ` Michal Suchánek
  2 siblings, 0 replies; 16+ messages in thread
From: Rick Lindsley @ 2021-04-22  7:05 UTC (permalink / raw)
  To: Lijun Pan, Sukadev Bhattiprolu
  Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon,
	netdev, Paul Mackerras, linuxppc-dev

On 4/21/21 10:06 PM, Lijun Pan wrote:
> No real customer runs the system under that heavy load created by
> HTX stress test, which can tear down any working system.

So, are you saying the bugs that HTX uncovers are not worth fixing?

There's a fair number of individuals and teams out there that
utilize HTX in the absence of such a workload, and not just for vnic.
What workload would you suggest to better serve "real customers"?

> I think such optimizations are catered for passing HTX tests.

Well, yes.  If the bugs are found with HTX, the fixes are verified
against HTX.  If they were found with a "real customer workload" they'd
be verified against a "real customer workload."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22  5:30 ` Lijun Pan
@ 2021-04-22  7:07   ` Rick Lindsley
  2021-04-22 17:01     ` Lijun Pan
  0 siblings, 1 reply; 16+ messages in thread
From: Rick Lindsley @ 2021-04-22  7:07 UTC (permalink / raw)
  To: Lijun Pan, Dany Madden
  Cc: David S. Miller, Jakub Kicinski, Sukadev Bhattiprolu,
	Thomas Falcon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, netdev, linuxppc-dev

On 4/21/21 10:30 PM, Lijun Pan wrote:
>> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
>> Signed-off-by: Dany Madden <drt@linux.ibm.com>
>> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
>> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
> 
> One thing I would like to point out as already pointed out by Nathan Lynch is
> that those review-by tags given by the same groups of people from the same
> company loses credibility over time if you never critique or ask
> questions on the list.
> 

Well, so far you aren't addressing either my critiques or questions.

I have been asking questions but all I have from you are the above
attempts to discredit the reputation of myself and other people, and
non-technical statements like

     will make the code very difficult to manage
     I think there should be a trade off between optimization and stability.
     So I don't think you could even compare the two results

On the other hand, from the original submission I see some very specific
details:

     If ibmvnic abandons the reset because of this failed set link
     down and this is the last reset in the workqueue, then this
     adapter will be left in an inoperable state.

and from a followup discussion:

     We had a FATAL error and when handling it, we failed to
     send a link-down message to the VIOS. So what we need
     to try next is to reset the connection with the VIOS. For
     this we must ...

These are great technical points that could be argued or discussed.
Problem is, I agree with them.

I will ask again:  can you please supply some technical reasons for
your objections.  Otherwise, your objections are meritless and at worst
simply an ad hominem attack.

Rick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22  7:07   ` Rick Lindsley
@ 2021-04-22 17:01     ` Lijun Pan
  0 siblings, 0 replies; 16+ messages in thread
From: Lijun Pan @ 2021-04-22 17:01 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Dany Madden, David S. Miller, Jakub Kicinski,
	Sukadev Bhattiprolu, Thomas Falcon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras, netdev, linuxppc-dev

On Thu, Apr 22, 2021 at 2:07 AM Rick Lindsley
<ricklind@linux.vnet.ibm.com> wrote:
>
> On 4/21/21 10:30 PM, Lijun Pan wrote:
> >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> >> Signed-off-by: Dany Madden <drt@linux.ibm.com>
> >> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
> >> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
> >
> > One thing I would like to point out as already pointed out by Nathan Lynch is
> > that those review-by tags given by the same groups of people from the same
> > company loses credibility over time if you never critique or ask
> > questions on the list.
> >
>
> Well, so far you aren't addressing either my critiques or questions.
>
> I have been asking questions but all I have from you are the above
> attempts to discredit the reputation of myself and other people, and
> non-technical statements like
>
>      will make the code very difficult to manage
>      I think there should be a trade off between optimization and stability.
>      So I don't think you could even compare the two results
>
> On the other hand, from the original submission I see some very specific
> details:
>
>      If ibmvnic abandons the reset because of this failed set link
>      down and this is the last reset in the workqueue, then this
>      adapter will be left in an inoperable state.
>
> and from a followup discussion:
>
>      We had a FATAL error and when handling it, we failed to
>      send a link-down message to the VIOS. So what we need
>      to try next is to reset the connection with the VIOS. For
>      this we must ...
>
> These are great technical points that could be argued or discussed.
> Problem is, I agree with them.
>
> I will ask again:  can you please supply some technical reasons for
> your objections.  Otherwise, your objections are meritless and at worst
> simply an ad hominem attack.

Well, from the beginning of v1, I started to provide technical inputs.
Then I was not
allowed to post anything in the community about this patch and VNIC
via ljp@linux.ibm.com except giving an ack-by/reviewed-by.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22  5:06     ` Lijun Pan
  2021-04-22  6:58       ` Sukadev Bhattiprolu
  2021-04-22  7:05       ` Rick Lindsley
@ 2021-04-22 17:21       ` Michal Suchánek
  2021-04-22 17:38         ` Lijun Pan
  2021-04-23  2:26         ` Rick Lindsley
  2 siblings, 2 replies; 16+ messages in thread
From: Michal Suchánek @ 2021-04-22 17:21 UTC (permalink / raw)
  To: Lijun Pan
  Cc: Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon,
	Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev,
	David Miller

Hello,

On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
> <sukadev@linux.ibm.com> wrote:
> >
> > Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> > >
> > >
> > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > > >
> > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > ibmvnic abandons the reset because of this failed set link down and this
> > > > is the last reset in the workqueue, then this adapter will be left in an
> > > > inoperable state.
> > > >
> > > > Instead, make the driver ignore this link down failure and continue to
> > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > recover.
> > >
> > > This v2 does not adddress the concerns mentioned in v1.
> > > And I think it is better to exit with error from do_reset, and schedule a thorough
> > > do_hard_reset if the the adapter is already in unstable state.
> >
> > We had a FATAL error and when handling it, we failed to send a
> > link-down message to the VIOS. So what we need to try next is to
> > reset the connection with the VIOS. For this we must talk to the
> > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > does just that in ibmvnic_reset_crq().
> >
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult
> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.

This debate is not very constructive.

In the context of driver that has separate do_reset and do_hard_reset
this fix picks the correct one unless you can refute the arguments
provided.

Merging do_reset and do_hard_reset might be a good code cleanup which is
out of the scope of this fix.



Given that vast majority of fixes to the vnic driver are related to the
reset handling it would improve stability and testability if every
reset took the same code path.

In the context of merging do_hard_reset and do_reset the question is
what is the intended distinction and performance gain by having
'lightweight' reset.

I don't have a vnic protocol manual at hand and I suspect I would not
get one even if I searched for one.

From reading through the fixes in the past my understanding is that the
full reset is required when the backend changes which then potentially
requires different size/number of buffers.

What is the expected situation when reset is required without changing
the backend?

Is this so common that it warrants a separate 'lightweight' optimized
function?

Thanks

Michal

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22 17:21       ` Michal Suchánek
@ 2021-04-22 17:38         ` Lijun Pan
  2021-04-23  2:26         ` Rick Lindsley
  1 sibling, 0 replies; 16+ messages in thread
From: Lijun Pan @ 2021-04-22 17:38 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon,
	Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev,
	David Miller

On Thu, Apr 22, 2021 at 12:21 PM Michal Suchánek <msuchanek@suse.de> wrote:
>
> Hello,
>
> On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
> > <sukadev@linux.ibm.com> wrote:
> > >
> > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> > > >
> > > >
> > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > > > >
> > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > > ibmvnic abandons the reset because of this failed set link down and this
> > > > > is the last reset in the workqueue, then this adapter will be left in an
> > > > > inoperable state.
> > > > >
> > > > > Instead, make the driver ignore this link down failure and continue to
> > > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > > recover.
> > > >
> > > > This v2 does not adddress the concerns mentioned in v1.
> > > > And I think it is better to exit with error from do_reset, and schedule a thorough
> > > > do_hard_reset if the the adapter is already in unstable state.
> > >
> > > We had a FATAL error and when handling it, we failed to send a
> > > link-down message to the VIOS. So what we need to try next is to
> > > reset the connection with the VIOS. For this we must talk to the
> > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > > does just that in ibmvnic_reset_crq().
> > >
> > > Now, sure we can attempt a "thorough hard reset" which also does
> > > the same hcalls to reestablish the connection. Is there any
> > > other magic in do_hard_reset()? But in addition, it also frees lot
> > > more Linux kernel buffers and reallocates them for instance.
> >
> > Working around everything in do_reset will make the code very difficult
> > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> > can be removed completely or merged into do_reset.

Michal,

I should have given more details about the above statement. Thanks for
your detailed info in the below.

>
> This debate is not very constructive.
>
> In the context of driver that has separate do_reset and do_hard_reset
> this fix picks the correct one unless you can refute the arguments
> provided.
>
> Merging do_reset and do_hard_reset might be a good code cleanup which is
> out of the scope of this fix.

Right.

>
>
>
> Given that vast majority of fixes to the vnic driver are related to the
> reset handling it would improve stability and testability if every
> reset took the same code path.

I agree.

>
> In the context of merging do_hard_reset and do_reset the question is
> what is the intended distinction and performance gain by having
> 'lightweight' reset.

Right.

>
> I don't have a vnic protocol manual at hand and I suspect I would not
> get one even if I searched for one.
>
> From reading through the fixes in the past my understanding is that the
> full reset is required when the backend changes which then potentially
> requires different size/number of buffers.

Yes, full reset is better when the backend changes.

>
> What is the expected situation when reset is required without changing
> the backend?
>
> Is this so common that it warrants a separate 'lightweight' optimized
> function?
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-22 17:21       ` Michal Suchánek
  2021-04-22 17:38         ` Lijun Pan
@ 2021-04-23  2:26         ` Rick Lindsley
  2021-05-04 19:05           ` Dany Madden
  1 sibling, 1 reply; 16+ messages in thread
From: Rick Lindsley @ 2021-04-23  2:26 UTC (permalink / raw)
  To: Michal Suchánek, Lijun Pan
  Cc: Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon,
	Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev,
	David Miller

On 4/22/21 10:21 AM, Michal Suchánek wrote:

> Merging do_reset and do_hard_reset might be a good code cleanup which is
> out of the scope of this fix.

I agree, on both points. Accepting the patch, and improving the reset
path are not in conflict with each other.

My position is that improving the reset path is certainly on the table,
but not with this bug or this fix.  Within the context of this discovered
problem, the issue is well understood, the fix is small and addresses the
immediate problem, and it has not generated new bugs under subsequent
testing.  For those reasons, the submitted patch has my support.

Rick


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
  2021-04-23  2:26         ` Rick Lindsley
@ 2021-05-04 19:05           ` Dany Madden
  0 siblings, 0 replies; 16+ messages in thread
From: Dany Madden @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Michal Suchánek, Lijun Pan, Sukadev Bhattiprolu, netdev,
	Lijun Pan, Tom Falcon, Paul Mackerras, Jakub Kicinski,
	linuxppc-dev, David Miller

On 2021-04-22 19:26, Rick Lindsley wrote:
> On 4/22/21 10:21 AM, Michal Suchánek wrote:
> 
>> Merging do_reset and do_hard_reset might be a good code cleanup which 
>> is
>> out of the scope of this fix.
> 
> I agree, on both points. Accepting the patch, and improving the reset
> path are not in conflict with each other.
> 
> My position is that improving the reset path is certainly on the table,
> but not with this bug or this fix.  Within the context of this 
> discovered
> problem, the issue is well understood, the fix is small and addresses 
> the
> immediate problem, and it has not generated new bugs under subsequent
> testing.  For those reasons, the submitted patch has my support.
> 
> Rick

Thanks for all the feedback on the patch. Refactoring the ibmvnic reset 
functions is something we plan to evaluate, as this would potentially 
simplify the reset code. In the mean time, the proposed patch is simple, 
and has been validated in our test environment to solve an issue 
resulting in vnic interfaces getting stuck in an offline state following 
a vnic timeout reset. Given that, I suggest we merge this patch while we 
look at further optimizations in future. I will submit a V3 patch 
shortly.

Dany

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-05-04 19:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-20 21:35 [PATCH V2 net] ibmvnic: Continue with reset if set link down failed Dany Madden
2021-04-20 21:42 ` Lijun Pan
2021-04-21  6:45   ` Sukadev Bhattiprolu
2021-04-22  5:06     ` Lijun Pan
2021-04-22  6:58       ` Sukadev Bhattiprolu
2021-04-22  7:05       ` Rick Lindsley
2021-04-22 17:21       ` Michal Suchánek
2021-04-22 17:38         ` Lijun Pan
2021-04-23  2:26         ` Rick Lindsley
2021-05-04 19:05           ` Dany Madden
2021-04-21  7:54   ` Rick Lindsley
2021-04-22  5:12     ` Lijun Pan
2021-04-22  7:05       ` Rick Lindsley
2021-04-22  5:30 ` Lijun Pan
2021-04-22  7:07   ` Rick Lindsley
2021-04-22 17:01     ` Lijun Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).