Re: fsl_espi errors on v5.7.15

From: Heiner Kallweit <hkallweit1@gmail.com>
To: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	"linux-spi@vger.kernel.org" <linux-spi@vger.kernel.org>,
	"broonie@kernel.org" <broonie@kernel.org>,
	"paulus@samba.org" <paulus@samba.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>
Subject: Re: fsl_espi errors on v5.7.15
Date: Fri, 4 Sep 2020 19:23:05 +0200	[thread overview]
Message-ID: <CAFSsGVvRMQoEoBN1hpao_4uM1yF2wwuKPbMkAcwFWyE1X8HDbQ@mail.gmail.com> (raw)
In-Reply-To: <6054f0ec-d994-105b-6399-6cdb65ddd1b6@alliedtelesis.co.nz>

[-- Attachment #1: Type: text/plain, Size: 10949 bytes --]

On Fri 4. Sep 2020 at 01:58, Chris Packham <
Chris.Packham@alliedtelesis.co.nz> wrote:

>
>
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
>
> > Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
>
> >> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>
> >>> On 30.08.2020 23:59, Chris Packham wrote:
>
> >>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>
> >>>>> On 30.08.2020 23:00, Chris Packham wrote:
>
> >>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>
> >>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>
> >>>>>> <snip>
>
> >>>>>>
>
> >>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> >>>>>>>>>>>>
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
> empty!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> >>>>>>>>>>>>
>
> >>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>
> >>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra
> byte in the
>
> >>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a
> READ_FSR.
>
> >>>>>>>>>>>>
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
> empty!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't
> empty!
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>
> >>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>
> >>>>>>>>>>>>
>
> >>>>>>>>>>>>      From all the Micron SPI-NOR datasheets I've got access
> to it is
>
> >>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why
> it
>
> >>>>>>>>>>>> happens some times and not others.
>
> >>>>>>>>>>> So I think I've got a reproduction and I think I've bisected
> the problem
>
> >>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt
> replay in
>
> >>>>>>>>>>> C"). My day is just finishing now so I haven't applied too
> much scrutiny
>
> >>>>>>>>>>> to this result. Given the various rabbit holes I've been down
> on this
>
> >>>>>>>>>>> issue already I'd take this information with a good degree of
> skepticism.
>
> >>>>>>>>>>>
>
> >>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>
> >>>>>>>>>> It doesn't have yet the change you're referring to, and the
> fsl-espi driver
>
> >>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>
> >>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related
> patches
>
> >>>>>>>>> around this time that could affect book E, so it's good if that
> exact
>
> >>>>>>>>> patch is confirmed.
>
> >>>>>>>> My confirmation is basically that I can induce the issue in a 5.4
> kernel
>
> >>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue
> in
>
> >>>>>>>> 5.9-rc2 by reverting that one commit.
>
> >>>>>>>>
>
> >>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>
> >>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which
> sees
>
> >>>>>>>> the issue) and the commit just before (which does not).
>
> >>>>>>> Thanks for testing, that confirms it well.
>
> >>>>>>>
>
> >>>>>>> [snip patch]
>
> >>>>>>>
>
> >>>>>>>> I still saw the issue with this change applied.
> PPC_IRQ_SOFT_MASK_DEBUG
>
> >>>>>>>> didn't report anything (either with or without the change above).
>
> >>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>
> >>>>>>> else has changed.
>
> >>>>>>>
>
> >>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>
> >>>>>>> higher interrupt latency?
>
> >>>>>>>
>
> >>>>>>> I don't think the patch should cause significantly worse latency,
>
> >>>>>>> (it's supposed to be a bit better if anything because it doesn't
> set
>
> >>>>>>> up the full interrupt frame). But it's possible.
>
> >>>>>> My working theory is that the SPI_DON indication is all about the TX
>
> >>>>>> direction an now that the interrupts are faster we're hitting an
> error
>
> >>>>>> because there is still RX activity going on. Heiner disagrees with
> my
>
> >>>>>> interpretation of the SPI_DON indication and the fact that it
> doesn't
>
> >>>>>> happen every time does throw doubt on it.
>
> >>>>>>
>
> >>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers
> to
>
> >>>>> TX only. However this wouldn't really make sense, because also for RX
>
> >>>>> we program the frame length, and therefore want to be notified once
> the
>
> >>>>> full frame was received. Also practical experience shows that SPI_DON
>
> >>>>> is set also after RX-only transfers.
>
> >>>>> Typical SPI NOR use case is that you write read command + start
> address,
>
> >>>>> followed by a longer read. If the TX-only interpretation would be
> right,
>
> >>>>> we'd always end up with SPI_DON not being set.
>
> >>>>>
>
> >>>>>> I can't really explain the extra RX byte in the fifo. We know how
> many
>
> >>>>>> bytes to expect and we pull that many from the fifo so it's not as
> if
>
> >>>>>> we're missing an interrupt causing us to skip the last byte. I've
> been
>
> >>>>>> looking for some kind of off-by-one calculation but again if it were
>
> >>>>>> something like that it'd happen all the time.
>
> >>>>>>
>
> >>>>> Maybe it helps to know what value this extra byte in the FIFO has.
> Is it:
>
> >>>>> - a duplicate of the last read byte
>
> >>>>> - or the next byte (at <end address> + 1)
>
> >>>>> - or a fixed value, e.g. always 0x00 or 0xff
>
> >>>> The values were up thread a bit but I'll repeat them here
>
> >>>>
>
> >>>> fsl_espi ffe110000.spi: tx 70
>
> >>>> fsl_espi ffe110000.spi: rx 03
>
> >>>> fsl_espi ffe110000.spi: Extra RX 00
>
> >>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>
> >>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> >>>> fsl_espi ffe110000.spi: tx 05
>
> >>>> fsl_espi ffe110000.spi: rx 00
>
> >>>> fsl_espi ffe110000.spi: Extra RX 03
>
> >>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>
> >>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>
> >>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> >>>> fsl_espi ffe110000.spi: tx 05
>
> >>>> fsl_espi ffe110000.spi: rx 00
>
> >>>> fsl_espi ffe110000.spi: Extra RX 03
>
> >>>>
>
> >>>>
>
> >>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them
> with
>
> >>>> either a READ_SR or a READ_FSR. Never a data read.
>
> >>>>
>
> >>> Just remembered something about SPIE_DON:
>
> >>> Transfers are always full duplex, therefore in case of a read the chip
>
> >>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>
> >>> that the last dummy zero was shifted out.
>
> >>>
>
> >>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte
> in.
>
> >>> So the issue may have a dependency on the length of the transfer.
>
> >>> However I see no good explanation so far. You can try adding a delay of
>
> >>> a few miroseconds between the following to commands in fsl_espi_bufs().
>
> >>>
>
> >>>     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>
> >>>
>
> >>>     /* Prevent filling the fifo from getting interrupted */
>
> >>>     spin_lock_irq(&espi->lock);
>
> >>>
>
> >>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too
> close.
>
> >> I think this might be heading in the right direction. Playing about with
>
> >> a delay does seem to make the two symptoms less likely. Although I have
>
> >> to set it quite high (i.e. msleep(100)) to completely avoid any
>
> >> possibility of seeing either message.
>
> > The patch might replay the interrupt a little bit faster, but it would
>
> > be a few microseconds at most I think (just from improved code).
>
> >
>
> > Would you be able to ftrace the interrupt handler function and see if you
>
> > can see a difference in number or timing of interrupts? I'm at a bit of
>
> > a loss.
>
>
>
> I tried ftrace but I really wasn't sure what I was looking for.
>
> Capturing a "bad" case was pretty tricky. But I think I've identified a
>
> fix (I'll send it as a proper patch shortly). The gist is
>
>
>
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
>
> index 7e7c92cafdbb..cb120b68c0e2 100644
>
> --- a/drivers/spi/spi-fsl-espi.c
>
> +++ b/drivers/spi/spi-fsl-espi.c
>
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
>
> *espi, u32 events)
>
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>
>   {
>
>          struct fsl_espi *espi = context_data;
>
> -       u32 events;
>
> +       u32 events, mask;
>
>
>
>          spin_lock(&espi->lock);
>
>
>
>          /* Get interrupt events(tx/rx) */
>
>          events = fsl_espi_read_reg(espi, ESPI_SPIE);
>
> -       if (!events) {
>
> +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
>
> +       if (!(events & mask)) {
>
>                  spin_unlock(&espi->lock);
>
>                  return IRQ_NONE;
>
>          }
>
>
>
> The SPIE register contains the TXCNT so events is pretty much always
>
> going to have something set. By checking events against what we've
>
> actually requested interrupts for we don't see any spurious events.
>
>
Usually we shouldn’t receive interrupts we’re not interested in, except the
interrupt is shared. This leads to the question: is the SPI interrupt
shared with another device on your system? Do you see spurious interrupts
with the patch under /proc/irq/(irq)/spurious?


>
> I've tested this on the T2080RDB and on our custom hardware and it seems
>
> to resolve the problem.
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 14916 bytes --]