* [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-20 21:35 ` Dany Madden 0 siblings, 0 replies; 32+ messages in thread From: Dany Madden @ 2021-04-20 21:35 UTC (permalink / raw) To: davem, kuba Cc: drt, sukadev, tlfalcon, mpe, benh, paulus, netdev, linuxppc-dev When ibmvnic gets a FATAL error message from the vnicserver, it marks the Command Respond Queue (CRQ) inactive and resets the adapter. If this FATAL reset fails and a transmission timeout reset follows, the CRQ is still inactive, ibmvnic's attempt to set link down will also fail. If ibmvnic abandons the reset because of this failed set link down and this is the last reset in the workqueue, then this adapter will be left in an inoperable state. Instead, make the driver ignore this link down failure and continue to free and re-register CRQ so that the adapter has an opportunity to recover. Fixes: ed651a10875f ("ibmvnic: Updated reset handling") Signed-off-by: Dany Madden <drt@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> --- Changes in V2: - Update description to clarify background for the patch - Include Reviewed-by tags --- drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index ffb2a91750c7..4bd8c5d1a275 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter, rtnl_unlock(); rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN); rtnl_lock(); - if (rc) - goto out; + if (rc) { + netdev_dbg(netdev, + "Setting link down failed rc=%d. Continue anyway\n", rc); + } if (adapter->state == VNIC_OPEN) { /* When we dropped rtnl, ibmvnic_open() got -- 2.26.2 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-20 21:35 ` Dany Madden 0 siblings, 0 replies; 32+ messages in thread From: Dany Madden @ 2021-04-20 21:35 UTC (permalink / raw) To: davem, kuba; +Cc: tlfalcon, netdev, paulus, drt, sukadev, linuxppc-dev When ibmvnic gets a FATAL error message from the vnicserver, it marks the Command Respond Queue (CRQ) inactive and resets the adapter. If this FATAL reset fails and a transmission timeout reset follows, the CRQ is still inactive, ibmvnic's attempt to set link down will also fail. If ibmvnic abandons the reset because of this failed set link down and this is the last reset in the workqueue, then this adapter will be left in an inoperable state. Instead, make the driver ignore this link down failure and continue to free and re-register CRQ so that the adapter has an opportunity to recover. Fixes: ed651a10875f ("ibmvnic: Updated reset handling") Signed-off-by: Dany Madden <drt@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> --- Changes in V2: - Update description to clarify background for the patch - Include Reviewed-by tags --- drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index ffb2a91750c7..4bd8c5d1a275 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter, rtnl_unlock(); rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN); rtnl_lock(); - if (rc) - goto out; + if (rc) { + netdev_dbg(netdev, + "Setting link down failed rc=%d. Continue anyway\n", rc); + } if (adapter->state == VNIC_OPEN) { /* When we dropped rtnl, ibmvnic_open() got -- 2.26.2 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-20 21:35 ` Dany Madden @ 2021-04-20 21:42 ` Lijun Pan -1 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-20 21:42 UTC (permalink / raw) To: Dany Madden Cc: David Miller, Jakub Kicinski, Tom Falcon, netdev, paulus, Sukadev Bhattiprolu, linuxppc-dev > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > FATAL reset fails and a transmission timeout reset follows, the CRQ is > still inactive, ibmvnic's attempt to set link down will also fail. If > ibmvnic abandons the reset because of this failed set link down and this > is the last reset in the workqueue, then this adapter will be left in an > inoperable state. > > Instead, make the driver ignore this link down failure and continue to > free and re-register CRQ so that the adapter has an opportunity to > recover. This v2 does not adddress the concerns mentioned in v1. And I think it is better to exit with error from do_reset, and schedule a thorough do_hard_reset if the the adapter is already in unstable state. > > Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > Signed-off-by: Dany Madden <drt@linux.ibm.com> > Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> > Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> > --- > Changes in V2: > - Update description to clarify background for the patch > - Include Reviewed-by tags > --- > drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c > index ffb2a91750c7..4bd8c5d1a275 100644 > --- a/drivers/net/ethernet/ibm/ibmvnic.c > +++ b/drivers/net/ethernet/ibm/ibmvnic.c > @@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter, > rtnl_unlock(); > rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN); > rtnl_lock(); > - if (rc) > - goto out; > + if (rc) { > + netdev_dbg(netdev, > + "Setting link down failed rc=%d. Continue anyway\n", rc); > + } > > if (adapter->state == VNIC_OPEN) { > /* When we dropped rtnl, ibmvnic_open() got > -- > 2.26.2 > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-20 21:42 ` Lijun Pan 0 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-20 21:42 UTC (permalink / raw) To: Dany Madden Cc: netdev, Tom Falcon, paulus, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > FATAL reset fails and a transmission timeout reset follows, the CRQ is > still inactive, ibmvnic's attempt to set link down will also fail. If > ibmvnic abandons the reset because of this failed set link down and this > is the last reset in the workqueue, then this adapter will be left in an > inoperable state. > > Instead, make the driver ignore this link down failure and continue to > free and re-register CRQ so that the adapter has an opportunity to > recover. This v2 does not adddress the concerns mentioned in v1. And I think it is better to exit with error from do_reset, and schedule a thorough do_hard_reset if the the adapter is already in unstable state. > > Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > Signed-off-by: Dany Madden <drt@linux.ibm.com> > Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> > Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> > --- > Changes in V2: > - Update description to clarify background for the patch > - Include Reviewed-by tags > --- > drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c > index ffb2a91750c7..4bd8c5d1a275 100644 > --- a/drivers/net/ethernet/ibm/ibmvnic.c > +++ b/drivers/net/ethernet/ibm/ibmvnic.c > @@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter, > rtnl_unlock(); > rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN); > rtnl_lock(); > - if (rc) > - goto out; > + if (rc) { > + netdev_dbg(netdev, > + "Setting link down failed rc=%d. Continue anyway\n", rc); > + } > > if (adapter->state == VNIC_OPEN) { > /* When we dropped rtnl, ibmvnic_open() got > -- > 2.26.2 > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-20 21:42 ` Lijun Pan @ 2021-04-21 6:45 ` Sukadev Bhattiprolu -1 siblings, 0 replies; 32+ messages in thread From: Sukadev Bhattiprolu @ 2021-04-21 6:45 UTC (permalink / raw) To: Lijun Pan Cc: Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev, paulus, linuxppc-dev Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > still inactive, ibmvnic's attempt to set link down will also fail. If > > ibmvnic abandons the reset because of this failed set link down and this > > is the last reset in the workqueue, then this adapter will be left in an > > inoperable state. > > > > Instead, make the driver ignore this link down failure and continue to > > free and re-register CRQ so that the adapter has an opportunity to > > recover. > > This v2 does not adddress the concerns mentioned in v1. > And I think it is better to exit with error from do_reset, and schedule a thorough > do_hard_reset if the the adapter is already in unstable state. We had a FATAL error and when handling it, we failed to send a link-down message to the VIOS. So what we need to try next is to reset the connection with the VIOS. For this we must talk to the firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() does just that in ibmvnic_reset_crq(). Now, sure we can attempt a "thorough hard reset" which also does the same hcalls to reestablish the connection. Is there any other magic in do_hard_reset()? But in addition, it also frees lot more Linux kernel buffers and reallocates them for instance. If we are having a communication problem with the VIOS, what is the point of freeing and reallocating Linux kernel buffers? Beside being inefficient, it would expose us to even more errors during reset under heavy workloads? From what I understand so far, do_reset() is complicated because it is attempting some optimizations. If we are going to fall back to hard reset for every error we might as well drop the do_reset() and just do the "thorough hard reset" every time right? The protocol spec is ambiguous and so far I did not get a clear answer on whether the link-down is even needed. If it is needed, then should we add it to do_hard_reset() also? If not, we should remove it (like you mentioned your earlier) completely but am waiting for confirmation on that. git history has not been helpful. While there are other rough edges around do_reset() that we are working on fixing separately (eg: ignore the error return from __ibmvnic_close() right above this change) I see a benefit to the customer with this patch. I am not convinced we should perform a hard reset just because the link down failed. Sukadev ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-21 6:45 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 32+ messages in thread From: Sukadev Bhattiprolu @ 2021-04-21 6:45 UTC (permalink / raw) To: Lijun Pan Cc: netdev, Tom Falcon, paulus, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > still inactive, ibmvnic's attempt to set link down will also fail. If > > ibmvnic abandons the reset because of this failed set link down and this > > is the last reset in the workqueue, then this adapter will be left in an > > inoperable state. > > > > Instead, make the driver ignore this link down failure and continue to > > free and re-register CRQ so that the adapter has an opportunity to > > recover. > > This v2 does not adddress the concerns mentioned in v1. > And I think it is better to exit with error from do_reset, and schedule a thorough > do_hard_reset if the the adapter is already in unstable state. We had a FATAL error and when handling it, we failed to send a link-down message to the VIOS. So what we need to try next is to reset the connection with the VIOS. For this we must talk to the firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() does just that in ibmvnic_reset_crq(). Now, sure we can attempt a "thorough hard reset" which also does the same hcalls to reestablish the connection. Is there any other magic in do_hard_reset()? But in addition, it also frees lot more Linux kernel buffers and reallocates them for instance. If we are having a communication problem with the VIOS, what is the point of freeing and reallocating Linux kernel buffers? Beside being inefficient, it would expose us to even more errors during reset under heavy workloads? From what I understand so far, do_reset() is complicated because it is attempting some optimizations. If we are going to fall back to hard reset for every error we might as well drop the do_reset() and just do the "thorough hard reset" every time right? The protocol spec is ambiguous and so far I did not get a clear answer on whether the link-down is even needed. If it is needed, then should we add it to do_hard_reset() also? If not, we should remove it (like you mentioned your earlier) completely but am waiting for confirmation on that. git history has not been helpful. While there are other rough edges around do_reset() that we are working on fixing separately (eg: ignore the error return from __ibmvnic_close() right above this change) I see a benefit to the customer with this patch. I am not convinced we should perform a hard reset just because the link down failed. Sukadev ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-21 6:45 ` Sukadev Bhattiprolu @ 2021-04-22 5:06 ` Lijun Pan -1 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 5:06 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev, Paul Mackerras, linuxppc-dev On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu <sukadev@linux.ibm.com> wrote: > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > ibmvnic abandons the reset because of this failed set link down and this > > > is the last reset in the workqueue, then this adapter will be left in an > > > inoperable state. > > > > > > Instead, make the driver ignore this link down failure and continue to > > > free and re-register CRQ so that the adapter has an opportunity to > > > recover. > > > > This v2 does not adddress the concerns mentioned in v1. > > And I think it is better to exit with error from do_reset, and schedule a thorough > > do_hard_reset if the the adapter is already in unstable state. > > We had a FATAL error and when handling it, we failed to send a > link-down message to the VIOS. So what we need to try next is to > reset the connection with the VIOS. For this we must talk to the > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > does just that in ibmvnic_reset_crq(). > > Now, sure we can attempt a "thorough hard reset" which also does > the same hcalls to reestablish the connection. Is there any > other magic in do_hard_reset()? But in addition, it also frees lot > more Linux kernel buffers and reallocates them for instance. Working around everything in do_reset will make the code very difficult to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset can be removed completely or merged into do_reset. > > If we are having a communication problem with the VIOS, what is > the point of freeing and reallocating Linux kernel buffers? Beside > being inefficient, it would expose us to even more errors during > reset under heavy workloads? No real customer runs the system under that heavy load created by HTX stress test, which can tear down any working system. > > From what I understand so far, do_reset() is complicated because > it is attempting some optimizations. If we are going to fall back > to hard reset for every error we might as well drop the do_reset() > and just do the "thorough hard reset" every time right? I think such optimizations are catered for passing HTX tests. Whether the optimization benefits the adapter, say making the adapter more stable, I doubt it. I think there should be a trade off between optimization and stability. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 5:06 ` Lijun Pan 0 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 5:06 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu <sukadev@linux.ibm.com> wrote: > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > ibmvnic abandons the reset because of this failed set link down and this > > > is the last reset in the workqueue, then this adapter will be left in an > > > inoperable state. > > > > > > Instead, make the driver ignore this link down failure and continue to > > > free and re-register CRQ so that the adapter has an opportunity to > > > recover. > > > > This v2 does not adddress the concerns mentioned in v1. > > And I think it is better to exit with error from do_reset, and schedule a thorough > > do_hard_reset if the the adapter is already in unstable state. > > We had a FATAL error and when handling it, we failed to send a > link-down message to the VIOS. So what we need to try next is to > reset the connection with the VIOS. For this we must talk to the > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > does just that in ibmvnic_reset_crq(). > > Now, sure we can attempt a "thorough hard reset" which also does > the same hcalls to reestablish the connection. Is there any > other magic in do_hard_reset()? But in addition, it also frees lot > more Linux kernel buffers and reallocates them for instance. Working around everything in do_reset will make the code very difficult to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset can be removed completely or merged into do_reset. > > If we are having a communication problem with the VIOS, what is > the point of freeing and reallocating Linux kernel buffers? Beside > being inefficient, it would expose us to even more errors during > reset under heavy workloads? No real customer runs the system under that heavy load created by HTX stress test, which can tear down any working system. > > From what I understand so far, do_reset() is complicated because > it is attempting some optimizations. If we are going to fall back > to hard reset for every error we might as well drop the do_reset() > and just do the "thorough hard reset" every time right? I think such optimizations are catered for passing HTX tests. Whether the optimization benefits the adapter, say making the adapter more stable, I doubt it. I think there should be a trade off between optimization and stability. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 5:06 ` Lijun Pan @ 2021-04-22 6:58 ` Sukadev Bhattiprolu -1 siblings, 0 replies; 32+ messages in thread From: Sukadev Bhattiprolu @ 2021-04-22 6:58 UTC (permalink / raw) To: Lijun Pan Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev, Paul Mackerras, linuxppc-dev Lijun Pan [lijunp213@gmail.com] wrote: > > Now, sure we can attempt a "thorough hard reset" which also does > > the same hcalls to reestablish the connection. Is there any > > other magic in do_hard_reset()? But in addition, it also frees lot > > more Linux kernel buffers and reallocates them for instance. > > Working around everything in do_reset will make the code very difficult We are not working around everything. We are doing in do_reset() exactly what we would do in hard reset for this error (ignore the set link down error and try to reestablish the connection with the VIOS). What we are avoiding is unnecessary work on the Linux side for a communication problem on the VIOS side. > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > can be removed completely or merged into do_reset. > > > > > If we are having a communication problem with the VIOS, what is > > the point of freeing and reallocating Linux kernel buffers? Beside > > being inefficient, it would expose us to even more errors during > > reset under heavy workloads? > > No real customer runs the system under that heavy load created by > HTX stress test, which can tear down any working system. We need to talk to capacity planning and test architects about that, but all I want to know is what hard reset would do differently to fix this communication error with VIOS. Sukadev ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 6:58 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 32+ messages in thread From: Sukadev Bhattiprolu @ 2021-04-22 6:58 UTC (permalink / raw) To: Lijun Pan Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller Lijun Pan [lijunp213@gmail.com] wrote: > > Now, sure we can attempt a "thorough hard reset" which also does > > the same hcalls to reestablish the connection. Is there any > > other magic in do_hard_reset()? But in addition, it also frees lot > > more Linux kernel buffers and reallocates them for instance. > > Working around everything in do_reset will make the code very difficult We are not working around everything. We are doing in do_reset() exactly what we would do in hard reset for this error (ignore the set link down error and try to reestablish the connection with the VIOS). What we are avoiding is unnecessary work on the Linux side for a communication problem on the VIOS side. > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > can be removed completely or merged into do_reset. > > > > > If we are having a communication problem with the VIOS, what is > > the point of freeing and reallocating Linux kernel buffers? Beside > > being inefficient, it would expose us to even more errors during > > reset under heavy workloads? > > No real customer runs the system under that heavy load created by > HTX stress test, which can tear down any working system. We need to talk to capacity planning and test architects about that, but all I want to know is what hard reset would do differently to fix this communication error with VIOS. Sukadev ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 5:06 ` Lijun Pan @ 2021-04-22 7:05 ` Rick Lindsley -1 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-22 7:05 UTC (permalink / raw) To: Lijun Pan, Sukadev Bhattiprolu Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev, Paul Mackerras, linuxppc-dev On 4/21/21 10:06 PM, Lijun Pan wrote: > No real customer runs the system under that heavy load created by > HTX stress test, which can tear down any working system. So, are you saying the bugs that HTX uncovers are not worth fixing? There's a fair number of individuals and teams out there that utilize HTX in the absence of such a workload, and not just for vnic. What workload would you suggest to better serve "real customers"? > I think such optimizations are catered for passing HTX tests. Well, yes. If the bugs are found with HTX, the fixes are verified against HTX. If they were found with a "real customer workload" they'd be verified against a "real customer workload." ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 7:05 ` Rick Lindsley 0 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-22 7:05 UTC (permalink / raw) To: Lijun Pan, Sukadev Bhattiprolu Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller On 4/21/21 10:06 PM, Lijun Pan wrote: > No real customer runs the system under that heavy load created by > HTX stress test, which can tear down any working system. So, are you saying the bugs that HTX uncovers are not worth fixing? There's a fair number of individuals and teams out there that utilize HTX in the absence of such a workload, and not just for vnic. What workload would you suggest to better serve "real customers"? > I think such optimizations are catered for passing HTX tests. Well, yes. If the bugs are found with HTX, the fixes are verified against HTX. If they were found with a "real customer workload" they'd be verified against a "real customer workload." ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 5:06 ` Lijun Pan @ 2021-04-22 17:21 ` Michal Suchánek -1 siblings, 0 replies; 32+ messages in thread From: Michal Suchánek @ 2021-04-22 17:21 UTC (permalink / raw) To: Lijun Pan Cc: Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller Hello, On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote: > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu > <sukadev@linux.ibm.com> wrote: > > > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > > ibmvnic abandons the reset because of this failed set link down and this > > > > is the last reset in the workqueue, then this adapter will be left in an > > > > inoperable state. > > > > > > > > Instead, make the driver ignore this link down failure and continue to > > > > free and re-register CRQ so that the adapter has an opportunity to > > > > recover. > > > > > > This v2 does not adddress the concerns mentioned in v1. > > > And I think it is better to exit with error from do_reset, and schedule a thorough > > > do_hard_reset if the the adapter is already in unstable state. > > > > We had a FATAL error and when handling it, we failed to send a > > link-down message to the VIOS. So what we need to try next is to > > reset the connection with the VIOS. For this we must talk to the > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > > does just that in ibmvnic_reset_crq(). > > > > Now, sure we can attempt a "thorough hard reset" which also does > > the same hcalls to reestablish the connection. Is there any > > other magic in do_hard_reset()? But in addition, it also frees lot > > more Linux kernel buffers and reallocates them for instance. > > Working around everything in do_reset will make the code very difficult > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > can be removed completely or merged into do_reset. This debate is not very constructive. In the context of driver that has separate do_reset and do_hard_reset this fix picks the correct one unless you can refute the arguments provided. Merging do_reset and do_hard_reset might be a good code cleanup which is out of the scope of this fix. Given that vast majority of fixes to the vnic driver are related to the reset handling it would improve stability and testability if every reset took the same code path. In the context of merging do_hard_reset and do_reset the question is what is the intended distinction and performance gain by having 'lightweight' reset. I don't have a vnic protocol manual at hand and I suspect I would not get one even if I searched for one. From reading through the fixes in the past my understanding is that the full reset is required when the backend changes which then potentially requires different size/number of buffers. What is the expected situation when reset is required without changing the backend? Is this so common that it warrants a separate 'lightweight' optimized function? Thanks Michal ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 17:21 ` Michal Suchánek 0 siblings, 0 replies; 32+ messages in thread From: Michal Suchánek @ 2021-04-22 17:21 UTC (permalink / raw) To: Lijun Pan Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller Hello, On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote: > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu > <sukadev@linux.ibm.com> wrote: > > > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > > ibmvnic abandons the reset because of this failed set link down and this > > > > is the last reset in the workqueue, then this adapter will be left in an > > > > inoperable state. > > > > > > > > Instead, make the driver ignore this link down failure and continue to > > > > free and re-register CRQ so that the adapter has an opportunity to > > > > recover. > > > > > > This v2 does not adddress the concerns mentioned in v1. > > > And I think it is better to exit with error from do_reset, and schedule a thorough > > > do_hard_reset if the the adapter is already in unstable state. > > > > We had a FATAL error and when handling it, we failed to send a > > link-down message to the VIOS. So what we need to try next is to > > reset the connection with the VIOS. For this we must talk to the > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > > does just that in ibmvnic_reset_crq(). > > > > Now, sure we can attempt a "thorough hard reset" which also does > > the same hcalls to reestablish the connection. Is there any > > other magic in do_hard_reset()? But in addition, it also frees lot > > more Linux kernel buffers and reallocates them for instance. > > Working around everything in do_reset will make the code very difficult > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > can be removed completely or merged into do_reset. This debate is not very constructive. In the context of driver that has separate do_reset and do_hard_reset this fix picks the correct one unless you can refute the arguments provided. Merging do_reset and do_hard_reset might be a good code cleanup which is out of the scope of this fix. Given that vast majority of fixes to the vnic driver are related to the reset handling it would improve stability and testability if every reset took the same code path. In the context of merging do_hard_reset and do_reset the question is what is the intended distinction and performance gain by having 'lightweight' reset. I don't have a vnic protocol manual at hand and I suspect I would not get one even if I searched for one. From reading through the fixes in the past my understanding is that the full reset is required when the backend changes which then potentially requires different size/number of buffers. What is the expected situation when reset is required without changing the backend? Is this so common that it warrants a separate 'lightweight' optimized function? Thanks Michal ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 17:21 ` Michal Suchánek @ 2021-04-22 17:38 ` Lijun Pan -1 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 17:38 UTC (permalink / raw) To: Michal Suchánek Cc: Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller On Thu, Apr 22, 2021 at 12:21 PM Michal Suchánek <msuchanek@suse.de> wrote: > > Hello, > > On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote: > > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu > > <sukadev@linux.ibm.com> wrote: > > > > > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > > > > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > > > ibmvnic abandons the reset because of this failed set link down and this > > > > > is the last reset in the workqueue, then this adapter will be left in an > > > > > inoperable state. > > > > > > > > > > Instead, make the driver ignore this link down failure and continue to > > > > > free and re-register CRQ so that the adapter has an opportunity to > > > > > recover. > > > > > > > > This v2 does not adddress the concerns mentioned in v1. > > > > And I think it is better to exit with error from do_reset, and schedule a thorough > > > > do_hard_reset if the the adapter is already in unstable state. > > > > > > We had a FATAL error and when handling it, we failed to send a > > > link-down message to the VIOS. So what we need to try next is to > > > reset the connection with the VIOS. For this we must talk to the > > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > > > does just that in ibmvnic_reset_crq(). > > > > > > Now, sure we can attempt a "thorough hard reset" which also does > > > the same hcalls to reestablish the connection. Is there any > > > other magic in do_hard_reset()? But in addition, it also frees lot > > > more Linux kernel buffers and reallocates them for instance. > > > > Working around everything in do_reset will make the code very difficult > > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > > can be removed completely or merged into do_reset. Michal, I should have given more details about the above statement. Thanks for your detailed info in the below. > > This debate is not very constructive. > > In the context of driver that has separate do_reset and do_hard_reset > this fix picks the correct one unless you can refute the arguments > provided. > > Merging do_reset and do_hard_reset might be a good code cleanup which is > out of the scope of this fix. Right. > > > > Given that vast majority of fixes to the vnic driver are related to the > reset handling it would improve stability and testability if every > reset took the same code path. I agree. > > In the context of merging do_hard_reset and do_reset the question is > what is the intended distinction and performance gain by having > 'lightweight' reset. Right. > > I don't have a vnic protocol manual at hand and I suspect I would not > get one even if I searched for one. > > From reading through the fixes in the past my understanding is that the > full reset is required when the backend changes which then potentially > requires different size/number of buffers. Yes, full reset is better when the backend changes. > > What is the expected situation when reset is required without changing > the backend? > > Is this so common that it warrants a separate 'lightweight' optimized > function? > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 17:38 ` Lijun Pan 0 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 17:38 UTC (permalink / raw) To: Michal Suchánek Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller On Thu, Apr 22, 2021 at 12:21 PM Michal Suchánek <msuchanek@suse.de> wrote: > > Hello, > > On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote: > > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu > > <sukadev@linux.ibm.com> wrote: > > > > > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote: > > > > > > > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote: > > > > > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > > > ibmvnic abandons the reset because of this failed set link down and this > > > > > is the last reset in the workqueue, then this adapter will be left in an > > > > > inoperable state. > > > > > > > > > > Instead, make the driver ignore this link down failure and continue to > > > > > free and re-register CRQ so that the adapter has an opportunity to > > > > > recover. > > > > > > > > This v2 does not adddress the concerns mentioned in v1. > > > > And I think it is better to exit with error from do_reset, and schedule a thorough > > > > do_hard_reset if the the adapter is already in unstable state. > > > > > > We had a FATAL error and when handling it, we failed to send a > > > link-down message to the VIOS. So what we need to try next is to > > > reset the connection with the VIOS. For this we must talk to the > > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > > > does just that in ibmvnic_reset_crq(). > > > > > > Now, sure we can attempt a "thorough hard reset" which also does > > > the same hcalls to reestablish the connection. Is there any > > > other magic in do_hard_reset()? But in addition, it also frees lot > > > more Linux kernel buffers and reallocates them for instance. > > > > Working around everything in do_reset will make the code very difficult > > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > > can be removed completely or merged into do_reset. Michal, I should have given more details about the above statement. Thanks for your detailed info in the below. > > This debate is not very constructive. > > In the context of driver that has separate do_reset and do_hard_reset > this fix picks the correct one unless you can refute the arguments > provided. > > Merging do_reset and do_hard_reset might be a good code cleanup which is > out of the scope of this fix. Right. > > > > Given that vast majority of fixes to the vnic driver are related to the > reset handling it would improve stability and testability if every > reset took the same code path. I agree. > > In the context of merging do_hard_reset and do_reset the question is > what is the intended distinction and performance gain by having > 'lightweight' reset. Right. > > I don't have a vnic protocol manual at hand and I suspect I would not > get one even if I searched for one. > > From reading through the fixes in the past my understanding is that the > full reset is required when the backend changes which then potentially > requires different size/number of buffers. Yes, full reset is better when the backend changes. > > What is the expected situation when reset is required without changing > the backend? > > Is this so common that it warrants a separate 'lightweight' optimized > function? > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 17:21 ` Michal Suchánek @ 2021-04-23 2:26 ` Rick Lindsley -1 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-23 2:26 UTC (permalink / raw) To: Michal Suchánek, Lijun Pan Cc: Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, linuxppc-dev, David Miller On 4/22/21 10:21 AM, Michal Suchánek wrote: > Merging do_reset and do_hard_reset might be a good code cleanup which is > out of the scope of this fix. I agree, on both points. Accepting the patch, and improving the reset path are not in conflict with each other. My position is that improving the reset path is certainly on the table, but not with this bug or this fix. Within the context of this discovered problem, the issue is well understood, the fix is small and addresses the immediate problem, and it has not generated new bugs under subsequent testing. For those reasons, the submitted patch has my support. Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-23 2:26 ` Rick Lindsley 0 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-23 2:26 UTC (permalink / raw) To: Michal Suchánek, Lijun Pan Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller On 4/22/21 10:21 AM, Michal Suchánek wrote: > Merging do_reset and do_hard_reset might be a good code cleanup which is > out of the scope of this fix. I agree, on both points. Accepting the patch, and improving the reset path are not in conflict with each other. My position is that improving the reset path is certainly on the table, but not with this bug or this fix. Within the context of this discovered problem, the issue is well understood, the fix is small and addresses the immediate problem, and it has not generated new bugs under subsequent testing. For those reasons, the submitted patch has my support. Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-23 2:26 ` Rick Lindsley @ 2021-05-04 19:05 ` Dany Madden -1 siblings, 0 replies; 32+ messages in thread From: Dany Madden @ 2021-05-04 19:05 UTC (permalink / raw) To: Rick Lindsley Cc: Michal Suchánek, Lijun Pan, Sukadev Bhattiprolu, netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Jakub Kicinski, linuxppc-dev, David Miller On 2021-04-22 19:26, Rick Lindsley wrote: > On 4/22/21 10:21 AM, Michal Suchánek wrote: > >> Merging do_reset and do_hard_reset might be a good code cleanup which >> is >> out of the scope of this fix. > > I agree, on both points. Accepting the patch, and improving the reset > path are not in conflict with each other. > > My position is that improving the reset path is certainly on the table, > but not with this bug or this fix. Within the context of this > discovered > problem, the issue is well understood, the fix is small and addresses > the > immediate problem, and it has not generated new bugs under subsequent > testing. For those reasons, the submitted patch has my support. > > Rick Thanks for all the feedback on the patch. Refactoring the ibmvnic reset functions is something we plan to evaluate, as this would potentially simplify the reset code. In the mean time, the proposed patch is simple, and has been validated in our test environment to solve an issue resulting in vnic interfaces getting stuck in an offline state following a vnic timeout reset. Given that, I suggest we merge this patch while we look at further optimizations in future. I will submit a V3 patch shortly. Dany ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-05-04 19:05 ` Dany Madden 0 siblings, 0 replies; 32+ messages in thread From: Dany Madden @ 2021-05-04 19:05 UTC (permalink / raw) To: Rick Lindsley Cc: Michal Suchánek, Lijun Pan, netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller On 2021-04-22 19:26, Rick Lindsley wrote: > On 4/22/21 10:21 AM, Michal Suchánek wrote: > >> Merging do_reset and do_hard_reset might be a good code cleanup which >> is >> out of the scope of this fix. > > I agree, on both points. Accepting the patch, and improving the reset > path are not in conflict with each other. > > My position is that improving the reset path is certainly on the table, > but not with this bug or this fix. Within the context of this > discovered > problem, the issue is well understood, the fix is small and addresses > the > immediate problem, and it has not generated new bugs under subsequent > testing. For those reasons, the submitted patch has my support. > > Rick Thanks for all the feedback on the patch. Refactoring the ibmvnic reset functions is something we plan to evaluate, as this would potentially simplify the reset code. In the mean time, the proposed patch is simple, and has been validated in our test environment to solve an issue resulting in vnic interfaces getting stuck in an offline state following a vnic timeout reset. Given that, I suggest we merge this patch while we look at further optimizations in future. I will submit a V3 patch shortly. Dany ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-20 21:42 ` Lijun Pan @ 2021-04-21 7:54 ` Rick Lindsley -1 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-21 7:54 UTC (permalink / raw) To: Lijun Pan, Dany Madden Cc: David Miller, Jakub Kicinski, Tom Falcon, netdev, paulus, Sukadev Bhattiprolu, linuxppc-dev On 4/20/21 2:42 PM, Lijun Pan wrote: > > This v2 does not adddress the concerns mentioned in v1. > And I think it is better to exit with error from do_reset, and schedule a thorough > do_hard_reset if the the adapter is already in unstable state. But the point is that the testing and analysis has indicated that doing a full hard reset is not necessary. We are about to take the very action which will fix this situation, but currently do not. Please describe the advantage in deferring it further by routing it through do_hard_reset(). I don't see one. Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-21 7:54 ` Rick Lindsley 0 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-21 7:54 UTC (permalink / raw) To: Lijun Pan, Dany Madden Cc: netdev, Tom Falcon, paulus, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller On 4/20/21 2:42 PM, Lijun Pan wrote: > > This v2 does not adddress the concerns mentioned in v1. > And I think it is better to exit with error from do_reset, and schedule a thorough > do_hard_reset if the the adapter is already in unstable state. But the point is that the testing and analysis has indicated that doing a full hard reset is not necessary. We are about to take the very action which will fix this situation, but currently do not. Please describe the advantage in deferring it further by routing it through do_hard_reset(). I don't see one. Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-21 7:54 ` Rick Lindsley @ 2021-04-22 5:12 ` Lijun Pan -1 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 5:12 UTC (permalink / raw) To: Rick Lindsley Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev, Paul Mackerras, Sukadev Bhattiprolu, linuxppc-dev On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley <ricklind@linux.vnet.ibm.com> wrote: > > On 4/20/21 2:42 PM, Lijun Pan wrote: > > > > This v2 does not adddress the concerns mentioned in v1. > > And I think it is better to exit with error from do_reset, and schedule a thorough > > do_hard_reset if the the adapter is already in unstable state. > > But the point is that the testing and analysis has indicated that doing a full > hard reset is not necessary. We are about to take the very action which will fix > this situation, but currently do not. The testing was done on this patch. It was not performed on a full hard reset. So I don't think you could even compare the two results. > > Please describe the advantage in deferring it further by routing it through > do_hard_reset(). I don't see one. It is not deferred. It exits with error and calls do_hard_reset. See my reply to Suka's. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 5:12 ` Lijun Pan 0 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 5:12 UTC (permalink / raw) To: Rick Lindsley Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley <ricklind@linux.vnet.ibm.com> wrote: > > On 4/20/21 2:42 PM, Lijun Pan wrote: > > > > This v2 does not adddress the concerns mentioned in v1. > > And I think it is better to exit with error from do_reset, and schedule a thorough > > do_hard_reset if the the adapter is already in unstable state. > > But the point is that the testing and analysis has indicated that doing a full > hard reset is not necessary. We are about to take the very action which will fix > this situation, but currently do not. The testing was done on this patch. It was not performed on a full hard reset. So I don't think you could even compare the two results. > > Please describe the advantage in deferring it further by routing it through > do_hard_reset(). I don't see one. It is not deferred. It exits with error and calls do_hard_reset. See my reply to Suka's. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 5:12 ` Lijun Pan @ 2021-04-22 7:05 ` Rick Lindsley -1 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-22 7:05 UTC (permalink / raw) To: Lijun Pan Cc: Lijun Pan, Dany Madden, David Miller, Jakub Kicinski, Tom Falcon, netdev, Paul Mackerras, Sukadev Bhattiprolu, linuxppc-dev On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley <ricklind@linux.vnet.ibm.com> wrote: >> Please describe the advantage in deferring it further by routing it through >> do_hard_reset(). I don't see one. On 4/21/21 10:12 PM, Lijun Pan replied: > It is not deferred. It exits with error and calls do_hard_reset. > See my reply to Suka's. I saw your reply, but it does not answer the question I asked. The patch would have us reinitialize and restart the communication queues. Your suggestion would do more work than that. Please describe the advantage in deferring the reinitialization - and yes, defer is the right word - by routing it through do_hard_reset() and executing that extra code. I see that route as doing more work than necessary and so introducing additional risk, for no clear advantage. So I would find it helpful if you would describe the advantage. > The testing was done on this patch. It was not performed on a full hard reset. > So I don't think you could even compare the two results. A problem has been noted, a patch has been proposed, and the reasoning that the patch is correct has been given. Testing with this patch has demonstrated the problem has not returned. So far, that sounds like a pretty reasonable solution. Your comment is curious - why would testing for this patch be done on a full hard reset when this patch does not invoke a full hard reset? If you have data to consider then let's have it. I'm willing to be convinced, but so far this just sounds like "I wouldn't do it that way myself, and I have a bad feeling about doing it any other way." Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 7:05 ` Rick Lindsley 0 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-22 7:05 UTC (permalink / raw) To: Lijun Pan Cc: netdev, Lijun Pan, Tom Falcon, Paul Mackerras, Dany Madden, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David Miller On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley <ricklind@linux.vnet.ibm.com> wrote: >> Please describe the advantage in deferring it further by routing it through >> do_hard_reset(). I don't see one. On 4/21/21 10:12 PM, Lijun Pan replied: > It is not deferred. It exits with error and calls do_hard_reset. > See my reply to Suka's. I saw your reply, but it does not answer the question I asked. The patch would have us reinitialize and restart the communication queues. Your suggestion would do more work than that. Please describe the advantage in deferring the reinitialization - and yes, defer is the right word - by routing it through do_hard_reset() and executing that extra code. I see that route as doing more work than necessary and so introducing additional risk, for no clear advantage. So I would find it helpful if you would describe the advantage. > The testing was done on this patch. It was not performed on a full hard reset. > So I don't think you could even compare the two results. A problem has been noted, a patch has been proposed, and the reasoning that the patch is correct has been given. Testing with this patch has demonstrated the problem has not returned. So far, that sounds like a pretty reasonable solution. Your comment is curious - why would testing for this patch be done on a full hard reset when this patch does not invoke a full hard reset? If you have data to consider then let's have it. I'm willing to be convinced, but so far this just sounds like "I wouldn't do it that way myself, and I have a bad feeling about doing it any other way." Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-20 21:35 ` Dany Madden @ 2021-04-22 5:30 ` Lijun Pan -1 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 5:30 UTC (permalink / raw) To: Dany Madden Cc: David S. Miller, Jakub Kicinski, Sukadev Bhattiprolu, Thomas Falcon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, netdev, linuxppc-dev On Tue, Apr 20, 2021 at 4:37 PM Dany Madden <drt@linux.ibm.com> wrote: > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > FATAL reset fails and a transmission timeout reset follows, the CRQ is > still inactive, ibmvnic's attempt to set link down will also fail. If > ibmvnic abandons the reset because of this failed set link down and this > is the last reset in the workqueue, then this adapter will be left in an > inoperable state. > > Instead, make the driver ignore this link down failure and continue to > free and re-register CRQ so that the adapter has an opportunity to > recover. > > Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > Signed-off-by: Dany Madden <drt@linux.ibm.com> > Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> > Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> One thing I would like to point out as already pointed out by Nathan Lynch is that those review-by tags given by the same groups of people from the same company loses credibility over time if you never critique or ask questions on the list. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 5:30 ` Lijun Pan 0 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 5:30 UTC (permalink / raw) To: Dany Madden Cc: Thomas Falcon, netdev, Paul Mackerras, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David S. Miller On Tue, Apr 20, 2021 at 4:37 PM Dany Madden <drt@linux.ibm.com> wrote: > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > FATAL reset fails and a transmission timeout reset follows, the CRQ is > still inactive, ibmvnic's attempt to set link down will also fail. If > ibmvnic abandons the reset because of this failed set link down and this > is the last reset in the workqueue, then this adapter will be left in an > inoperable state. > > Instead, make the driver ignore this link down failure and continue to > free and re-register CRQ so that the adapter has an opportunity to > recover. > > Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > Signed-off-by: Dany Madden <drt@linux.ibm.com> > Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> > Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> One thing I would like to point out as already pointed out by Nathan Lynch is that those review-by tags given by the same groups of people from the same company loses credibility over time if you never critique or ask questions on the list. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 5:30 ` Lijun Pan @ 2021-04-22 7:07 ` Rick Lindsley -1 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-22 7:07 UTC (permalink / raw) To: Lijun Pan, Dany Madden Cc: David S. Miller, Jakub Kicinski, Sukadev Bhattiprolu, Thomas Falcon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, netdev, linuxppc-dev On 4/21/21 10:30 PM, Lijun Pan wrote: >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling") >> Signed-off-by: Dany Madden <drt@linux.ibm.com> >> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> >> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> > > One thing I would like to point out as already pointed out by Nathan Lynch is > that those review-by tags given by the same groups of people from the same > company loses credibility over time if you never critique or ask > questions on the list. > Well, so far you aren't addressing either my critiques or questions. I have been asking questions but all I have from you are the above attempts to discredit the reputation of myself and other people, and non-technical statements like will make the code very difficult to manage I think there should be a trade off between optimization and stability. So I don't think you could even compare the two results On the other hand, from the original submission I see some very specific details: If ibmvnic abandons the reset because of this failed set link down and this is the last reset in the workqueue, then this adapter will be left in an inoperable state. and from a followup discussion: We had a FATAL error and when handling it, we failed to send a link-down message to the VIOS. So what we need to try next is to reset the connection with the VIOS. For this we must ... These are great technical points that could be argued or discussed. Problem is, I agree with them. I will ask again: can you please supply some technical reasons for your objections. Otherwise, your objections are meritless and at worst simply an ad hominem attack. Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 7:07 ` Rick Lindsley 0 siblings, 0 replies; 32+ messages in thread From: Rick Lindsley @ 2021-04-22 7:07 UTC (permalink / raw) To: Lijun Pan, Dany Madden Cc: Thomas Falcon, netdev, Paul Mackerras, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David S. Miller On 4/21/21 10:30 PM, Lijun Pan wrote: >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling") >> Signed-off-by: Dany Madden <drt@linux.ibm.com> >> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> >> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> > > One thing I would like to point out as already pointed out by Nathan Lynch is > that those review-by tags given by the same groups of people from the same > company loses credibility over time if you never critique or ask > questions on the list. > Well, so far you aren't addressing either my critiques or questions. I have been asking questions but all I have from you are the above attempts to discredit the reputation of myself and other people, and non-technical statements like will make the code very difficult to manage I think there should be a trade off between optimization and stability. So I don't think you could even compare the two results On the other hand, from the original submission I see some very specific details: If ibmvnic abandons the reset because of this failed set link down and this is the last reset in the workqueue, then this adapter will be left in an inoperable state. and from a followup discussion: We had a FATAL error and when handling it, we failed to send a link-down message to the VIOS. So what we need to try next is to reset the connection with the VIOS. For this we must ... These are great technical points that could be argued or discussed. Problem is, I agree with them. I will ask again: can you please supply some technical reasons for your objections. Otherwise, your objections are meritless and at worst simply an ad hominem attack. Rick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed 2021-04-22 7:07 ` Rick Lindsley @ 2021-04-22 17:01 ` Lijun Pan -1 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 17:01 UTC (permalink / raw) To: Rick Lindsley Cc: Dany Madden, David S. Miller, Jakub Kicinski, Sukadev Bhattiprolu, Thomas Falcon, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, netdev, linuxppc-dev On Thu, Apr 22, 2021 at 2:07 AM Rick Lindsley <ricklind@linux.vnet.ibm.com> wrote: > > On 4/21/21 10:30 PM, Lijun Pan wrote: > >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > >> Signed-off-by: Dany Madden <drt@linux.ibm.com> > >> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> > >> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> > > > > One thing I would like to point out as already pointed out by Nathan Lynch is > > that those review-by tags given by the same groups of people from the same > > company loses credibility over time if you never critique or ask > > questions on the list. > > > > Well, so far you aren't addressing either my critiques or questions. > > I have been asking questions but all I have from you are the above > attempts to discredit the reputation of myself and other people, and > non-technical statements like > > will make the code very difficult to manage > I think there should be a trade off between optimization and stability. > So I don't think you could even compare the two results > > On the other hand, from the original submission I see some very specific > details: > > If ibmvnic abandons the reset because of this failed set link > down and this is the last reset in the workqueue, then this > adapter will be left in an inoperable state. > > and from a followup discussion: > > We had a FATAL error and when handling it, we failed to > send a link-down message to the VIOS. So what we need > to try next is to reset the connection with the VIOS. For > this we must ... > > These are great technical points that could be argued or discussed. > Problem is, I agree with them. > > I will ask again: can you please supply some technical reasons for > your objections. Otherwise, your objections are meritless and at worst > simply an ad hominem attack. Well, from the beginning of v1, I started to provide technical inputs. Then I was not allowed to post anything in the community about this patch and VNIC via ljp@linux.ibm.com except giving an ack-by/reviewed-by. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed @ 2021-04-22 17:01 ` Lijun Pan 0 siblings, 0 replies; 32+ messages in thread From: Lijun Pan @ 2021-04-22 17:01 UTC (permalink / raw) To: Rick Lindsley Cc: Thomas Falcon, netdev, Paul Mackerras, Dany Madden, Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev, David S. Miller On Thu, Apr 22, 2021 at 2:07 AM Rick Lindsley <ricklind@linux.vnet.ibm.com> wrote: > > On 4/21/21 10:30 PM, Lijun Pan wrote: > >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > >> Signed-off-by: Dany Madden <drt@linux.ibm.com> > >> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> > >> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> > > > > One thing I would like to point out as already pointed out by Nathan Lynch is > > that those review-by tags given by the same groups of people from the same > > company loses credibility over time if you never critique or ask > > questions on the list. > > > > Well, so far you aren't addressing either my critiques or questions. > > I have been asking questions but all I have from you are the above > attempts to discredit the reputation of myself and other people, and > non-technical statements like > > will make the code very difficult to manage > I think there should be a trade off between optimization and stability. > So I don't think you could even compare the two results > > On the other hand, from the original submission I see some very specific > details: > > If ibmvnic abandons the reset because of this failed set link > down and this is the last reset in the workqueue, then this > adapter will be left in an inoperable state. > > and from a followup discussion: > > We had a FATAL error and when handling it, we failed to > send a link-down message to the VIOS. So what we need > to try next is to reset the connection with the VIOS. For > this we must ... > > These are great technical points that could be argued or discussed. > Problem is, I agree with them. > > I will ask again: can you please supply some technical reasons for > your objections. Otherwise, your objections are meritless and at worst > simply an ad hominem attack. Well, from the beginning of v1, I started to provide technical inputs. Then I was not allowed to post anything in the community about this patch and VNIC via ljp@linux.ibm.com except giving an ack-by/reviewed-by. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2021-05-04 19:06 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-20 21:35 [PATCH V2 net] ibmvnic: Continue with reset if set link down failed Dany Madden 2021-04-20 21:35 ` Dany Madden 2021-04-20 21:42 ` Lijun Pan 2021-04-20 21:42 ` Lijun Pan 2021-04-21 6:45 ` Sukadev Bhattiprolu 2021-04-21 6:45 ` Sukadev Bhattiprolu 2021-04-22 5:06 ` Lijun Pan 2021-04-22 5:06 ` Lijun Pan 2021-04-22 6:58 ` Sukadev Bhattiprolu 2021-04-22 6:58 ` Sukadev Bhattiprolu 2021-04-22 7:05 ` Rick Lindsley 2021-04-22 7:05 ` Rick Lindsley 2021-04-22 17:21 ` Michal Suchánek 2021-04-22 17:21 ` Michal Suchánek 2021-04-22 17:38 ` Lijun Pan 2021-04-22 17:38 ` Lijun Pan 2021-04-23 2:26 ` Rick Lindsley 2021-04-23 2:26 ` Rick Lindsley 2021-05-04 19:05 ` Dany Madden 2021-05-04 19:05 ` Dany Madden 2021-04-21 7:54 ` Rick Lindsley 2021-04-21 7:54 ` Rick Lindsley 2021-04-22 5:12 ` Lijun Pan 2021-04-22 5:12 ` Lijun Pan 2021-04-22 7:05 ` Rick Lindsley 2021-04-22 7:05 ` Rick Lindsley 2021-04-22 5:30 ` Lijun Pan 2021-04-22 5:30 ` Lijun Pan 2021-04-22 7:07 ` Rick Lindsley 2021-04-22 7:07 ` Rick Lindsley 2021-04-22 17:01 ` Lijun Pan 2021-04-22 17:01 ` Lijun Pan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.