Re: [PATCH] platform/chrome: cros_ec_proto: Send command again when timeout occurs

From: Greg KH <greg@kroah.com>
To: Patryk Duda <pdk@semihalf.com>
Cc: Benson Leung <bleung@chromium.org>,
	Guenter Roeck <groeck@chromium.org>,
	linux-kernel@vger.kernel.org, upstream@semihalf.com,
	stable@vger.kernel.org
Subject: Re: [PATCH] platform/chrome: cros_ec_proto: Send command again when timeout occurs
Date: Tue, 18 May 2021 13:34:00 +0200	[thread overview]
Message-ID: <YKOmKHPCPx3FuTV3@kroah.com> (raw)
In-Reply-To: <CAGOBvLq8hyRs7MZoZE0o0h0s9y1bL9eV3ex2A3FsmfjtofxsAg@mail.gmail.com>

On Tue, May 18, 2021 at 01:22:30PM +0200, Patryk Duda wrote:
> wt., 18 maj 2021 o 11:29 Greg KH <greg@kroah.com> napisał(a):
> >
> > On Tue, May 18, 2021 at 11:09:25AM +0200, Patryk Duda wrote:
> > > Sometimes kernel is trying to probe Fingerprint MCU (FPMCU) when it
> > > hasn't initialized SPI yet. This can happen because FPMCU is restarted
> > > during system boot and kernel can send message in short window
> > > eg. between sysjump to RW and SPI initialization.
> > >
> > > Cc: <stable@vger.kernel.org> # 4.4+
> > > Signed-off-by: Patryk Duda <pdk@semihalf.com>
> > > ---
> > > Fingerprint MCU is rebooted during system startup by AP firmware (coreboot).
> > > During cold boot kernel can query FPMCU in a window just after jump to RW
> > > section of firmware but before SPI is initialized. The window was
> > > shortened to <1ms, but it can't be eliminated completly.
> > >
> > > Communication with FPMCU (and all devices based on EC) is bi-directional.
> > > When kernel sends message, EC will send EC_SPI* status codes. When EC is
> > > not able to process command one of bytes will be eg. EC_SPI_NOT_READY.
> > > This mechanism won't work when SPI is not initailized on EC side. In fact,
> > > buffer is filled with 0xFF bytes, so from kernel perspective device is not
> > > responding. To avoid this problem, we can query device once again. We are
> > > already waiting EC_MSG_DEADLINE_MS for response, so we can send command
> > > immediately.
> > >
> > > Best regards,
> > > Patryk
> > >  drivers/platform/chrome/cros_ec_proto.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > >
> > > diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
> > > index aa7f7aa77297..3384631d21e2 100644
> > > --- a/drivers/platform/chrome/cros_ec_proto.c
> > > +++ b/drivers/platform/chrome/cros_ec_proto.c
> > > @@ -279,6 +279,18 @@ static int cros_ec_host_command_proto_query(struct cros_ec_device *ec_dev,
> > >       msg->insize = sizeof(struct ec_response_get_protocol_info);
> > >
> > >       ret = send_command(ec_dev, msg);
> > > +     /*
> > > +      * Send command once again when timeout occurred.
> > > +      * Fingerprint MCU (FPMCU) is restarted during system boot which
> > > +      * introduces small window in which FPMCU won't respond for any
> > > +      * messages sent by kernel. There is no need to wait before next
> > > +      * attempt because we waited at least EC_MSG_DEADLINE_MS.
> > > +      */
> > > +     if (ret == -ETIMEDOUT) {
> > > +             dev_warn(ec_dev->dev,
> > > +                      "Timeout to get response from EC. Retrying.\n");
> >
> > If a user sees this, what can they do?  No need to spam the kernel logs,
> > just retry.
> User can do nothing about it. I will remove this in next version of patch.
> >
> > > +             ret = send_command(ec_dev, msg);
> >
> > But wait, why just retry once?  Why not 10 times?  100?  1000?  How
> > about a simple loop here instead, with a "sane" number of retries as a
> > max.
> EC based devices are designed to respond always or return appropriate
> status code
> when they can't process command. But this assumes that SPI is always
> ready to work.
> It's true for Embedded Controller, but not for Fingerprint MCU. So we
> can retry once,
> in case of sending message, when FPMCU is in narrow window (~1ms) when SPI is
> not initialized.
> 
> Every send_command() call can take about 200ms when device is not responding,
> so next retry will happen after 200ms, at least. If 200ms is not
> enough for FPMCU
> to initialize SPI, it's definitely something wrong with FPMCU

Ok, then just loop for 2 for this, should make it more obvious.

thanks,

greg k-h