Re: fsl_espi errors on v5.7.15

From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
To: Heiner Kallweit <hkallweit1@gmail.com>,
	"broonie@kernel.org" <broonie@kernel.org>,
	"mpe@ellerman.id.au" <mpe@ellerman.id.au>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
	"paulus@samba.org" <paulus@samba.org>
Cc: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-spi@vger.kernel.org" <linux-spi@vger.kernel.org>
Subject: Re: fsl_espi errors on v5.7.15
Date: Wed, 26 Aug 2020 01:48:06 +0000	[thread overview]
Message-ID: <519c3068-6c73-c17a-2016-1afe2a1d12f7@alliedtelesis.co.nz> (raw)
In-Reply-To: <31e91237-4f87-66da-dac3-2304779e5749@alliedtelesis.co.nz>

On 26/08/20 10:22 am, Chris Packham wrote:
> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>
> <snip>
>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>> identified a couple of deficiencies that may or may not be related 
>>> to my
>>> issue.
>>>
>>> First I think the 'Transfer done but SPIE_DON isn't set' message can be
>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE register.
>>> We also write back to it to clear the current events. We re-read it in
>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>> naturally end up in that situation if we're doing a large read. 
>>> Consider
>>> the messages for reading a block of data from a spi-nor chip
>>>
>>>    tx = READ_OP + ADDR
>>>    rx = data
>>>
>>> We setup the transfer and pump out the tx_buf. The first interrupt goes
>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>> continues until we've received all the data and we finish with 
>>> ESPI_SPIE
>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>> isn't set.
>>>
>>> The other deficiency is that we only get an interrupt when the 
>>> amount of
>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>>> the fifo.
>>>
>> SPIM_DON will trigger an interrupt once the last characters have been
>> transferred, and read the remaining characters from the FIFO.
>
> The T2080RM that I have says the following about the DON bit
>
> "Last character was transmitted. The last character was transmitted 
> and a new command can be written for the next frame."
>
> That does at least seem to fit with my assertion that it's all about 
> the TX direction. But the fact that it doesn't happen all the time 
> throws some doubt on it.
>
>> I think the reason I'm seeing some variability is because of how fast
>>> (or slow) the interrupts get processed and how fast the spi-nor chip 
>>> can
>>> fill the CPUs rx fifo.
>>>
>> To rule out timing issues at high bus frequencies I initially asked
>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>> or even less, then timing shouldn't be an issue.
> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I 
> would also expect a slower frequency would fit my "DON is for TX" 
> narrative.
>> Last relevant functional changes have been done almost 4 years ago.
>> And yours is the first such report I see. So question is what could 
>> be so
>> special with your setup that it seems you're the only one being 
>> affected.
>> The scenarios you describe are standard, therefore much more people
>> should be affected in case of a driver bug.
> Agreed. But even on my hardware (which may have a latent issue despite 
> being in the field for going on 5 years) the issue only triggers under 
> some fairly specific circumstances.
>> You said that kernel config impacts how frequently the issue happens.
>> Therefore question is what's the diff in kernel config, and how could
>> the differences be related to SPI.
>
> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
> impact but every time I found something that seemed to be having an 
> impact I've been able to disprove it. I actually think its about how 
> busy the system is which may or may not affect when we get round to 
> processing the interrupts.
>
> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
> occur on the T2080RDB.
>
> I've had to add the following to expose the environment as a mtd 
> partition
>
> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> index ff87e67c70da..fbf95fc1fd68 100644
> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
> @@ -116,6 +116,15 @@ flash@0 {
>                                 compatible = "micron,n25q512ax3", 
> "jedec,spi-nor";
>                                 reg = <0>;
>                                 spi-max-frequency = <10000000>; /* 
> input clock */
> +
> +                               partition@u-boot {
> +                                        reg = <0x00000000 0x00100000>;
> +                                        label = "u-boot";
> +                                };
> +                                partition@u-boot-env {
> +                                        reg = <0x00100000 0x00010000>;
> +                                        label = "u-boot-env";
> +                                };
>                         };
>                 };
>
> And I'm using the following script to poke at the environment (warning 
> if anyone does try this and the bug hits it can render your u-boot 
> environment invalid).
>
> cat flash/fw_env_test.sh
> #!/bin/sh
>
> generate_fw_env_config()
> {
>   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize name 
> ; do
>      echo "$dev $size $erasesize $name"
>      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 
> $erasesize" >/flash/fw_env.config
>   done
> }
>
> cycles=10
> [ $# -ge 1 ] && cycles=$1
>
> generate_fw_env_config
>
> fw_printenv -c /flash/fw_env.config
>
> dmesg -c >/dev/null
> x=0
> while [ $x -lt $cycles ]; do
>     fw_printenv -c /flash/fw_env.config >/dev/null || break
>     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
>     dmesg -c | grep -q fsl_espi && break;
>     let x=x+1
> done
>
> echo "Ran $x cycles"

I've also now seen the RX FIFO not empty error on the T2080RDB

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

With my current workaround of emptying the RX FIFO. It seems survivable. 
Interestingly it only ever seems to be 1 extra byte in the RX FIFO and 
it seems to be after either a READ_SR or a READ_FSR.

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03

 From all the Micron SPI-NOR datasheets I've got access to it is 
possible to continually read the SR/FSR. But I've no idea why it happens 
some times and not others.