[PATCH net] ibmvnic: continue fatal error reset after passive init

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net] ibmvnic: continue fatal error reset after passive init
@ 2020-12-19 21:40 Lijun Pan
  2020-12-23  2:46 ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Lijun Pan @ 2020-12-19 21:40 UTC (permalink / raw)
  To: netdev; +Cc: Lijun Pan

Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
says "If the passive
CRQ initialization occurs before the FATAL reset task is processed,
the FATAL error reset task would try to access a CRQ message queue
that was freed, causing an oops. The problem may be most likely to
occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
process will automatically issue a change MTU request.
Fix this by not processing fatal error reset if CRQ is passively
initialized after client-driven CRQ initialization fails."

Even with this commit, we still see similar kernel crashes. In order
to completely solve this problem, we'd better continue the fatal error
reset, capture the kernel crash, and try to fix it from that end.

Fixes: f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index b370c88a43f1..237a36040689 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2342,8 +2342,7 @@ static void __ibmvnic_reset(struct work_struct *work)
 				set_current_state(TASK_UNINTERRUPTIBLE);
 				schedule_timeout(60 * HZ);
 			}
-		} else if (!(rwi->reset_reason == VNIC_RESET_FATAL &&
-				adapter->from_passive_init)) {
+		} else {
 			rc = do_reset(adapter, rwi, reset_state);
 		}
 		kfree(rwi);
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ibmvnic: continue fatal error reset after passive init
  2020-12-19 21:40 [PATCH net] ibmvnic: continue fatal error reset after passive init Lijun Pan
@ 2020-12-23  2:46 ` Jakub Kicinski
  2020-12-23  8:21   ` Lijun Pan
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2020-12-23  2:46 UTC (permalink / raw)
  To: Lijun Pan; +Cc: netdev

On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:
> Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> says "If the passive
> CRQ initialization occurs before the FATAL reset task is processed,
> the FATAL error reset task would try to access a CRQ message queue
> that was freed, causing an oops. The problem may be most likely to
> occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> process will automatically issue a change MTU request.
> Fix this by not processing fatal error reset if CRQ is passively
> initialized after client-driven CRQ initialization fails."
> 
> Even with this commit, we still see similar kernel crashes. In order
> to completely solve this problem, we'd better continue the fatal error
> reset, capture the kernel crash, and try to fix it from that end.

This basically reverts the quoted fix. Does the quoted fix make things
worse? Otherwise we should leave the code be until proper fix is found.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ibmvnic: continue fatal error reset after passive init
  2020-12-23  2:46 ` Jakub Kicinski
@ 2020-12-23  8:21   ` Lijun Pan
  2020-12-23 16:50     ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Lijun Pan @ 2020-12-23  8:21 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Lijun Pan, netdev

On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:
> > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> > says "If the passive
> > CRQ initialization occurs before the FATAL reset task is processed,
> > the FATAL error reset task would try to access a CRQ message queue
> > that was freed, causing an oops. The problem may be most likely to
> > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> > process will automatically issue a change MTU request.
> > Fix this by not processing fatal error reset if CRQ is passively
> > initialized after client-driven CRQ initialization fails."
> >
> > Even with this commit, we still see similar kernel crashes. In order
> > to completely solve this problem, we'd better continue the fatal error
> > reset, capture the kernel crash, and try to fix it from that end.
>
> This basically reverts the quoted fix. Does the quoted fix make things
> worse? Otherwise we should leave the code be until proper fix is found.

Yes, I think the quoted commit makes things worse. It skips the specific
reset condition, but that does not fix the problem it claims to fix.
The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I
think reverting it to the original "else" condition is the right thing to do.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ibmvnic: continue fatal error reset after passive init
  2020-12-23  8:21   ` Lijun Pan
@ 2020-12-23 16:50     ` Jakub Kicinski
  2020-12-23 20:10       ` Lijun Pan
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2020-12-23 16:50 UTC (permalink / raw)
  To: Lijun Pan; +Cc: Lijun Pan, netdev

On Wed, 23 Dec 2020 02:21:09 -0600 Lijun Pan wrote:
> On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:  
> > > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> > > says "If the passive
> > > CRQ initialization occurs before the FATAL reset task is processed,
> > > the FATAL error reset task would try to access a CRQ message queue
> > > that was freed, causing an oops. The problem may be most likely to
> > > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> > > process will automatically issue a change MTU request.
> > > Fix this by not processing fatal error reset if CRQ is passively
> > > initialized after client-driven CRQ initialization fails."
> > >
> > > Even with this commit, we still see similar kernel crashes. In order
> > > to completely solve this problem, we'd better continue the fatal error
> > > reset, capture the kernel crash, and try to fix it from that end.  
> >
> > This basically reverts the quoted fix. Does the quoted fix make things
> > worse? Otherwise we should leave the code be until proper fix is found.  
> 
> Yes, I think the quoted commit makes things worse. It skips the specific
> reset condition, but that does not fix the problem it claims to fix.

Okay, let's make sure the commit message explains how it makes things
worse.

> The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I
> think reverting it to the original "else" condition is the right thing to do.

Hm. So the problem is fixed? But the commit message says "we still see
similar kernel crashes", that's present tense suggesting that crashes 
are seen on current net/master. Are you saying that's not the case and
after 0e435befaea4 and a0faaa27c716 there are no more crashes?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ibmvnic: continue fatal error reset after passive init
  2020-12-23 16:50     ` Jakub Kicinski
@ 2020-12-23 20:10       ` Lijun Pan
  2020-12-23 20:24         ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Lijun Pan @ 2020-12-23 20:10 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Lijun Pan, netdev

On Wed, Dec 23, 2020 at 10:50 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 23 Dec 2020 02:21:09 -0600 Lijun Pan wrote:
> > On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:
> > > > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> > > > says "If the passive
> > > > CRQ initialization occurs before the FATAL reset task is processed,
> > > > the FATAL error reset task would try to access a CRQ message queue
> > > > that was freed, causing an oops. The problem may be most likely to
> > > > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> > > > process will automatically issue a change MTU request.
> > > > Fix this by not processing fatal error reset if CRQ is passively
> > > > initialized after client-driven CRQ initialization fails."
> > > >
> > > > Even with this commit, we still see similar kernel crashes. In order
> > > > to completely solve this problem, we'd better continue the fatal error
> > > > reset, capture the kernel crash, and try to fix it from that end.
> > >
> > > This basically reverts the quoted fix. Does the quoted fix make things
> > > worse? Otherwise we should leave the code be until proper fix is found.
> >
> > Yes, I think the quoted commit makes things worse. It skips the specific
> > reset condition, but that does not fix the problem it claims to fix.
>
> Okay, let's make sure the commit message explains how it makes things
> worse.

I will reword the commit message.

>
> > The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I
> > think reverting it to the original "else" condition is the right thing to do.
>
> Hm. So the problem is fixed? But the commit message says "we still see
> similar kernel crashes", that's present tense suggesting that crashes
> are seen on current net/master. Are you saying that's not the case and
> after 0e435befaea4 and a0faaa27c716 there are no more crashes?

This patch was formed before I submitted 0e435befaea4 and a0faaa27c716, so
I used the wording "we still see similar kernel crashes". I will modify
the commit message before I submit v2 of this patch.
After 0e435befaea4 and a0faaa27c716, I don't see any crashes as described
in this quoted commit even without this quoted commit.
That's why I am sure this quoted commit does not fix the described problem
and I want to revert it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ibmvnic: continue fatal error reset after passive init
  2020-12-23 20:10       ` Lijun Pan
@ 2020-12-23 20:24         ` Jakub Kicinski
  0 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2020-12-23 20:24 UTC (permalink / raw)
  To: Lijun Pan; +Cc: Lijun Pan, netdev

On Wed, 23 Dec 2020 14:10:32 -0600 Lijun Pan wrote:
> On Wed, Dec 23, 2020 at 10:50 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Wed, 23 Dec 2020 02:21:09 -0600 Lijun Pan wrote:  
> > > On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > > On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:  
> > > > > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> > > > > says "If the passive
> > > > > CRQ initialization occurs before the FATAL reset task is processed,
> > > > > the FATAL error reset task would try to access a CRQ message queue
> > > > > that was freed, causing an oops. The problem may be most likely to
> > > > > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> > > > > process will automatically issue a change MTU request.
> > > > > Fix this by not processing fatal error reset if CRQ is passively
> > > > > initialized after client-driven CRQ initialization fails."
> > > > >
> > > > > Even with this commit, we still see similar kernel crashes. In order
> > > > > to completely solve this problem, we'd better continue the fatal error
> > > > > reset, capture the kernel crash, and try to fix it from that end.  
> > > >
> > > > This basically reverts the quoted fix. Does the quoted fix make things
> > > > worse? Otherwise we should leave the code be until proper fix is found.  
> > >
> > > Yes, I think the quoted commit makes things worse. It skips the specific
> > > reset condition, but that does not fix the problem it claims to fix.  
> >
> > Okay, let's make sure the commit message explains how it makes things
> > worse.  
> 
> I will reword the commit message.
> 
> > > The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I
> > > think reverting it to the original "else" condition is the right thing to do.  
> >
> > Hm. So the problem is fixed? But the commit message says "we still see
> > similar kernel crashes", that's present tense suggesting that crashes
> > are seen on current net/master. Are you saying that's not the case and
> > after 0e435befaea4 and a0faaa27c716 there are no more crashes?  
> 
> This patch was formed before I submitted 0e435befaea4 and a0faaa27c716, so
> I used the wording "we still see similar kernel crashes". I will modify
> the commit message before I submit v2 of this patch.
> After 0e435befaea4 and a0faaa27c716, I don't see any crashes as described
> in this quoted commit even without this quoted commit.
> That's why I am sure this quoted commit does not fix the described problem
> and I want to revert it.

I see, that explains it!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-23 20:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-19 21:40 [PATCH net] ibmvnic: continue fatal error reset after passive init Lijun Pan
2020-12-23  2:46 ` Jakub Kicinski
2020-12-23  8:21   ` Lijun Pan
2020-12-23 16:50     ` Jakub Kicinski
2020-12-23 20:10       ` Lijun Pan
2020-12-23 20:24         ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.