LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: Lijun Pan <lijunp213@gmail.com>
To: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Cc: netdev@vger.kernel.org, Lijun Pan <ljp@linux.vnet.ibm.com>,
	Tom Falcon <tlfalcon@linux.ibm.com>,
	Paul Mackerras <paulus@samba.org>,
	Dany Madden <drt@linux.ibm.com>, Jakub Kicinski <kuba@kernel.org>,
	linuxppc-dev@lists.ozlabs.org, David Miller <davem@davemloft.net>
Subject: Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
Date: Thu, 22 Apr 2021 00:06:45 -0500
Message-ID: <CAOhMmr4ckVFTZtSeHFHNgGPUA12xYO8WcUoakx7WdwQfSKBJhA@mail.gmail.com> (raw)
In-Reply-To: <20210421064527.GA2648262@us.ibm.com>

On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
<sukadev@linux.ibm.com> wrote:
>
> Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> >
> >
> > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > >
> > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > ibmvnic abandons the reset because of this failed set link down and this
> > > is the last reset in the workqueue, then this adapter will be left in an
> > > inoperable state.
> > >
> > > Instead, make the driver ignore this link down failure and continue to
> > > free and re-register CRQ so that the adapter has an opportunity to
> > > recover.
> >
> > This v2 does not adddress the concerns mentioned in v1.
> > And I think it is better to exit with error from do_reset, and schedule a thorough
> > do_hard_reset if the the adapter is already in unstable state.
>
> We had a FATAL error and when handling it, we failed to send a
> link-down message to the VIOS. So what we need to try next is to
> reset the connection with the VIOS. For this we must talk to the
> firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> does just that in ibmvnic_reset_crq().
>
> Now, sure we can attempt a "thorough hard reset" which also does
> the same hcalls to reestablish the connection. Is there any
> other magic in do_hard_reset()? But in addition, it also frees lot
> more Linux kernel buffers and reallocates them for instance.

Working around everything in do_reset will make the code very difficult
to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
can be removed completely or merged into do_reset.

>
> If we are having a communication problem with the VIOS, what is
> the point of freeing and reallocating Linux kernel buffers? Beside
> being inefficient, it would expose us to even more errors during
> reset under heavy workloads?

No real customer runs the system under that heavy load created by
HTX stress test, which can tear down any working system.

>
> From what I understand so far, do_reset() is complicated because
> it is attempting some optimizations.  If we are going to fall back
> to hard reset for every error we might as well drop the do_reset()
> and just do the "thorough hard reset" every time right?

I think such optimizations are catered for passing HTX tests. Whether
the optimization benefits the adapter, say making the adapter more stable,
I doubt it. I think there should be a trade off between optimization
and stability.

  reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20 21:35 Dany Madden
2021-04-20 21:42 ` Lijun Pan
2021-04-21  6:45   ` Sukadev Bhattiprolu
2021-04-22  5:06     ` Lijun Pan [this message]
2021-04-22  6:58       ` Sukadev Bhattiprolu
2021-04-22  7:05       ` Rick Lindsley
2021-04-22 17:21       ` Michal Suchánek
2021-04-22 17:38         ` Lijun Pan
2021-04-23  2:26         ` Rick Lindsley
2021-05-04 19:05           ` Dany Madden
2021-04-21  7:54   ` Rick Lindsley
2021-04-22  5:12     ` Lijun Pan
2021-04-22  7:05       ` Rick Lindsley
2021-04-22  5:30 ` Lijun Pan
2021-04-22  7:07   ` Rick Lindsley
2021-04-22 17:01     ` Lijun Pan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOhMmr4ckVFTZtSeHFHNgGPUA12xYO8WcUoakx7WdwQfSKBJhA@mail.gmail.com \
    --to=lijunp213@gmail.com \
    --cc=davem@davemloft.net \
    --cc=drt@linux.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=ljp@linux.vnet.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=sukadev@linux.ibm.com \
    --cc=tlfalcon@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git