LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
To: Lijun Pan <ljp@linux.vnet.ibm.com>
Cc: netdev@vger.kernel.org, Tom Falcon <tlfalcon@linux.ibm.com>,
	paulus@samba.org, Dany Madden <drt@linux.ibm.com>,
	Jakub Kicinski <kuba@kernel.org>,
	linuxppc-dev@lists.ozlabs.org, David Miller <davem@davemloft.net>
Subject: Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
Date: Tue, 20 Apr 2021 23:45:27 -0700
Message-ID: <20210421064527.GA2648262@us.ibm.com> (raw)
In-Reply-To: <60C99F56-617D-455B-9ACF-8CE1EED64D92@linux.vnet.ibm.com>

Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> 
> 
> > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > 
> > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > still inactive, ibmvnic's attempt to set link down will also fail. If
> > ibmvnic abandons the reset because of this failed set link down and this
> > is the last reset in the workqueue, then this adapter will be left in an
> > inoperable state.
> > 
> > Instead, make the driver ignore this link down failure and continue to
> > free and re-register CRQ so that the adapter has an opportunity to
> > recover.
> 
> This v2 does not adddress the concerns mentioned in v1.
> And I think it is better to exit with error from do_reset, and schedule a thorough
> do_hard_reset if the the adapter is already in unstable state.

We had a FATAL error and when handling it, we failed to send a 
link-down message to the VIOS. So what we need to try next is to 
reset the connection with the VIOS. For this we must talk to the 
firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
does just that in ibmvnic_reset_crq().

Now, sure we can attempt a "thorough hard reset" which also does
the same hcalls to reestablish the connection. Is there any
other magic in do_hard_reset()? But in addition, it also frees lot
more Linux kernel buffers and reallocates them for instance.

If we are having a communication problem with the VIOS, what is
the point of freeing and reallocating Linux kernel buffers? Beside
being inefficient, it would expose us to even more errors during
reset under heavy workloads?

From what I understand so far, do_reset() is complicated because
it is attempting some optimizations.  If we are going to fall back
to hard reset for every error we might as well drop the do_reset()
and just do the "thorough hard reset" every time right?

The protocol spec is ambiguous and so far I did not get a clear
answer on whether the link-down is even needed. If it is needed,
then should we add it to do_hard_reset() also? If not, we should
remove it (like you mentioned your earlier) completely but am
waiting for confirmation on that. git history has not been helpful.

While there are other rough edges around do_reset() that we are
working on fixing separately (eg: ignore the error return from 
__ibmvnic_close() right above this change) I see a benefit to
the customer with this patch.

I am not convinced we should perform a hard reset just because
the link down failed.

Sukadev

  reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20 21:35 Dany Madden
2021-04-20 21:42 ` Lijun Pan
2021-04-21  6:45   ` Sukadev Bhattiprolu [this message]
2021-04-22  5:06     ` Lijun Pan
2021-04-22  6:58       ` Sukadev Bhattiprolu
2021-04-22  7:05       ` Rick Lindsley
2021-04-22 17:21       ` Michal Suchánek
2021-04-22 17:38         ` Lijun Pan
2021-04-23  2:26         ` Rick Lindsley
2021-05-04 19:05           ` Dany Madden
2021-04-21  7:54   ` Rick Lindsley
2021-04-22  5:12     ` Lijun Pan
2021-04-22  7:05       ` Rick Lindsley
2021-04-22  5:30 ` Lijun Pan
2021-04-22  7:07   ` Rick Lindsley
2021-04-22 17:01     ` Lijun Pan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210421064527.GA2648262@us.ibm.com \
    --to=sukadev@linux.ibm.com \
    --cc=davem@davemloft.net \
    --cc=drt@linux.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=ljp@linux.vnet.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=tlfalcon@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git